In order for our program to create a word frequency histogram for a text file, we need a way of telling the program which text file to read from. One way is to have the program ask for the file name, once started, but that isn't very smooth. A better way is to do it like most programs do; accept command line parameters. For example, the program can be started like this:
[D:\]wordhist c:\readme
and then create a word frequency histogram for "c:\readme."
Command line parameters are sent to the "main" function. For "main" to get the parameters, it must be declared a bit differently, though, with a parameter list. The version of "main" to use when accepting command line parameters is:
int main(int argc, char* argv[])
The parameter "argv" may look a little frightening, but it isn't that bad. As I mentioned last month, a string is an array of character (char stringname[size]), and when passing arrays to a function, it's actually the pointer to the first element that is passed. For strings, it's very usual to see "char* stringname" in the parameter list of a function. With this little explanation, we can see that "argv" is an array of strings. Also as mentioned last month, you lose the size information when passing an array to a function. For the string, this is not important, since all strings are null terminated (i.e. the last character is (char)0). The number of strings passed, however, is sent in "argc."
Let's test this with a small program:
#includeint main(int argc, char* argv[]) { int index; for (index = 0; index < argc; ++index) { printf("%2d : %s\n", index, argv[index]); } return 0; }
Call this program "argtest.exe" and run it a couple of times. This is how it looks for me:
[d:\tmp]argtest 0 : D:\TMP\ARGTEST.EXE [d:\tmp]argtest sdklf wer sdj 0 : D:\TMP\ARGTEST.EXE 1 : sdklf 2 : wer 3 : sdj [d:\tmp]argtest.exe 1 0 : D:\TMP\ARGTEST.EXE 1 : 1
As you see, there is always at least one argument, "argv[0]," and it is always the name of the program itself. Unfortunately, the contents of "argv[0]" might differ from compiler to compiler and also depend on the shell it is started from. Visual Age C++ always pass the name of the program as sent by the shell. When using 4OS2 as my shell, I get the above result. When using CMD.EXE, I get the name exactly as I type it.
All programs leave something called a "return code" when they terminate. The norm is to return 0 for successful execution, and non-zero for error reporting. The return code is often used in scripts, and to combine programs on the command line. Let's change "argtest" a little bit, and return "argc-1" instead of 0, just to test it.
Running it by itself makes no difference from before, but combined with other programs through "&&" and "||" shows something:
[D:\tmp]argtest 1 && echo OK 0 : argtest 1 : 1 [D:\tmp]argtest && echo OK 0 : argtest OK [D:\tmp]argtest 1 ||echo OK 0 : argtest 1 : 1 OK [D:\tmp]argtest ||echo OK 0 : argtest
We can see here that the return code and the operation ("&&" or "||") is what determines if the second command should run or not.
On the command line, and in scripts, you can also use the "if errorlevel n" construct, which executes whatever follows if the return code from the previously run program is greater than, or equal to n. This example shows how it works:
[D:\tmp]argtest 0 : argtest [D:\tmp]if errorlevel 1 echo error [D:\tmp]argtest df 0 : argtest 1 : df [D:\tmp]if errorlevel 1 echo error error
Maybe you noticed that the "printf" call in the example has a new detail in the formatting string. The detail is the number 2 in:
printf("%2d : %s\n", index, argv[index]);
The interpretation of the above is to print "index" as an integer, just as "%d" usually means, but reserve a width of 2 characters for it. The number 1, for example, will be written as " 1". The number of digits is not limited to 2, though, so it's still possible to print numbers requiring more than 2 digits this way.
In most programming environments, including C programming, dealing with files resembles real-life dealing with books. You can find out some data about a file by looking at it, but to read from or write to it, it must be opened.
Unlike books, however, you specify your intent when opening a file. You specify that you intend to read it, write in it (or both). In OS/2 and DOS, you must also specify if the file is binary or not. When done, the file must be closed.
All file handling functions and data types are defined in
Here is a small example program, opening a file specified in the command line, and printing its contents:
#includeint main(int argc, char* argv[]) { FILE* file; /* 1 */ char line[1024]; if (argc != 2) /* 2 */ { printf("Usage: %s filename\n", argv[0]); return 1; } file = fopen(argv[1], "r"); /* 3 */ if (file == NULL) { printf("Failed to open %s\n", argv[1]); return 2; } while (fgets(line, sizeof(line), file) != NULL) /* 4 */ { printf("%s", line); } fclose(file); /* 5 */ return 0; }
Now there is a lot to explain:
What we want to do, for the word frequency histogram program, is to read words, and not lines. Unfortunately, there is no function in the ANSI C library that reads words from a file, so we must define our own.
For the wordfile to be useful, it must have a number of characteristics. For example it must support:
Preferably, it should also be possible to specify what separates words, since this depends on the context.
Here is what the prototypes might look like:
int wordfile_open(const char* name); int wordfile_close(void); size_t wordfile_nextword(char* buffer, size_t buffersize);
Two new things, that must be explained, just turned up. What does "const char*", in the parameter list for "wordfile_open" mean? The type "const char*" is a pointer to a constant character, that is, a pointer to a character which may not change. Well, in fact it may change, but not through the pointer. In other words, the character (or in this case, character string) passed, does not need to be a constant. Instead "const" is a promise, saying that this pointer will not cause the character to change. An example will explain this better:
char a = 'a'; char* pa = &a; const char* pca = &a; char* pb = pca; /* this is an error!!! */ printf("%c", a); *pa = 'p'; /* changes "a", since "pa" points to it. printf("%c", a); *pca = 'c'; /* this is an error!!! */
First "pa" is set to point to "a". This is what was explained last month. That "pca" can point to "a" is not an error. "a" can be changed, either directly or through "pa", but it will not change due to us doing something with "pca," so the promise holds. "pb" cannot get its value from "pca" however, so this line would lead to a compilation error. "pb" is not const, so it promises nothing, meaning it could break the promise "pca" made. Since "pca" has promised not to change whatever it points to, nothing that can change it, can get its value from "pca." The last line in the example results in a compilation error because it is illegal to assign a value to the dereferenced const pointer, since otherwise the promise would be broken.
Returning to our wordfile, const in the parameter list means that "wordfile_open" promises not to alter the string passed as the name.
The next new thing is "size_t." This is a type, declared in a number of the
standard headers,
Before beginning to write the code for the wordfile, it's wise to spend a few minutes thinking about how we want it to behave.
What should "wordfile_open" do?
int wordfile_open(const char* name);
The normal operation is of course to just open the file. How do we tell the user of "wordfile_open" if it was successful in opening the file? What parameters are legal? What do we do if the proposed file does not exist, or cannot be opened for reading? Can several wordfiles be opened at the same time?
To keep things simple for now, I propose the following characteristics for "wordfile_open"
Now we do the same for the other functions of the wordfile.
int wordfile_close(void);
What do we do if the wordfile is not open? What if it is open, but for some reason cannot be closed? Proposal:
size_t wordfile_nextword(char* buffer, size_t buffersize);
What values for "buffer" and "buffersize" are legal? What do we do if the wordfile is not open? What do we do if there is not room for the word found in buffer? What do we do if end of file is reached? Proposal:
Now we can write the header file "wordfile.h", and document all the above.
/* Usage: */ /* #include(or or for size_t */ /* #include "wordfile.h" */ int wordfile_open(const char* name); /* Open the file with the name passed as a wordfile. If the */ /* file does not exist, open fails. */ /* */ /* Return values: 0 failure to open the file. */ /* 1 succeeded in opening the file. */ /* */ /* Preconditions: */ /* A wordfile must not be open */ /* name != NULL */ /* */ /* Postconditions: */ /* If success, the file is open. */ int wordfile_close(void); /* Close the open wordfile. */ /* */ /* Return values: 0 failure to close the file. */ /* 1 succeeded in closing the file. */ /* */ /* Preconditions: */ /* The wordfile must be open. */ /* */ /* Postconditions: */ /* If success, the wordfile is closed. */ size_t wordfile_nextword(char* buffer, size_t buffersize); /* Get the next word from the open wordfile. If no word is */ /* encountered before end of file, 0 is returned. If eof */ /* is reached while reading a word, the word is copied */ /* into buffer and its length returned. Copy as much of */ /* the word as there is room for in buffer. If a word */ /* longer than buffersize is encountered, the remaining */ /* part of the word is lost */ /* */ /* Return values: */ /* The length of the word copied into buffer. If the length */ /* equals buffersize, buffersize-1 characters are copied */ /* into buffer. */ /* */ /* Preconditions: */ /* The wordfile must be open. */ /* buffersize >= 2. */ /* buffer != NULL */
Without mentioning it, I have now explained part of the "programming by contract" concept. For all the functions above, you see a comment part called "Preconditions:" It lists things that must be true when calling the function. For some functions, you also see a "Postconditions:" listing things that will be true when the function has returned. The idea behind "programming by contract" is to make clear who is responsible for what. The functions with post conditions say "If you promise [Precondition:] I promise [Postcondition:] will be true when I'm done." If the precondition is violated, the caller of the function is guilty of doing something wrong. If the postcondition is violated, the function has failed to do its job. "wordfile_nextword" should have a post condition, but it's very difficult to state one that can be checked, since it depends so much on the file.
When identifying the pre- and post-conditions above, I was careful in making sure they
were all possible to check for. There is a macro defined in
#includevoid function(int parameter1, int parameter2) { assert(parameter1 > parameter2); } int main(void) { function(2,1); function(1,2); return 0; }
When I run this program, I get the following result:
[d:\tmp]asserttest Assertion failed: parameter1 > parameter2, file asserttest.c, line 5 Abnormal program termination
Not too bad? It would of course be better if it somehow could point out the call that violated the condition, but it's as close as you can get with ANSI/ISO C.
The problem with these kind of checks, is that you usually only want them during development, and maybe beta test. You don't normally want them in the final product, because the tests aren't supposed to fail, but making them takes time. "assert" handles this by doing nothing at all if the macro "NDEBUG" is defined when compiling. "NDEBUG", unlike "assert" does not behave like a function. Instead its presence causes "assert" to do nothing. Most compilers allow defining macros in the parameter list, and oddly, most compilers seem to agree on doing this with the -D flag. An example:
[d:\tmp]gcc -DNDEBUG asserttest.c [d:\tmp]asserttest [d:\tmp]
Just by providing the "-DNDEBUG" flag when compiling, the test was removed.
Back to our wordfile. Have you noticed, by the way, that I have so far not mentioned a word about how this should be implemented? This is not because I've forgotten, but because until now it has been unimportant. What should be done is the most important thing. The job itself can be done in many different ways, but someone using the wordfile is not interested in that.
Now, however, we should start thinking of how to implement it, and the skeleton of "wordfile.c" can be written right away, and make use of "assert" to check the conditions.
#include/* FILE and size_t types and io functions */ #include /* assert macro */ #include "wordfile.h" /* The prototypes */ static FILE* file = NULL; /* 1. Explained after the listing */ int wordfile_open(const char* name) { int retval = 0; /* Preconditions: */ /* A wordfile must not be open */ assert(file == NULL); /* name != NULL */ assert(name != NULL); /* Postconditions: */ /* If success, the file is open. */ assert(!retval || file != NULL); /*either open failed, or file must be !=NULL */ return retval; } int wordfile_close(void) { int retval = 0; /* Preconditions: */ /* The wordfile must be open. */ assert(file != NULL); /* Postconditions: */ /* If success, the wordfile is closed. */ assert(!retval || file == NULL); return retval; } size_t wordfile_nextword(char* buffer, size_t buffersize) { size_t retval = 0; /* Preconditions: */ /* The wordfile must be open. */ assert(file != NULL); /* buffersize >= 2. */ assert(buffersize >= 2); /* buffer != NULL */ assert(buffer != NULL); return retval; }
Before filling in the blanks there is another C detail that requires an explanation. Near the top, you find a line:
static FILE* file = NULL; /* 1. Explained after the listing */
In this context, the keyword "static" means that the variable "file" is only accessible from this file. It means that if, in another file, an identifier named "file" is referred to, it will not collide with this one. Used like this, static has two advantages: One is that other parts of the program cannot reach the identifier. The other, very similar, is that the global name space is not polluted. If "static" was not available for use like this, you'd have to find some clever name to avoid clashes with names defined in other parts (that perhaps someone else has written), and you'd still not be sure that no one manipulates it without your knowledge. If the variable was not declared "static", someone making use of an identifier named "file" would manipulate this one!
Now to fill in the blanks.
#include/* FILE type and io functions */ #include /* is**** */ #include /* assert macro */ #include "wordfile.h" /* The prototypes */ static FILE* file = NULL; int wordfile_open(const char* name) { int retval = 0; /* Preconditions: */ /* A wordfile must not be open */ assert(file == NULL); /* name != NULL */ assert(name != NULL); file = fopen(name, "r"); /* open for reading as text */ retval = file != NULL; /* return error if open failed */ /* Postconditions: */ /* If success, the file is open. */ assert(!retval || file != NULL); /*either open failed, or file must be !=NULL */ return retval; } int wordfile_close(void) { int retval = 0; /* Preconditions: */ /* The wordfile must be open. */ assert(file != NULL); if (fclose(file) == 0) /* did the close succeed? */ { retval = 1; file = NULL; /* mark the file as closed */ } /* Postconditions: */ /* If success, the wordfile is closed. */ assert(!retval || file == NULL); return retval; } size_t wordfile_nextword(char* buffer, size_t buffersize) { size_t retval = 0; int c = 0; /* Preconditions: */ /* The wordfile must be open. */ assert(file != NULL); /* buffersize >= 2. */ assert(buffersize >= 2); /* buffer != NULL */ assert(buffer != NULL); while ((c = fgetc(file)) != EOF && !isalnum(c)) /** 1 **/ ;/* loop until we find an alphanumeric character or EOF */ while (isalnum(c)) /** 2 **/ { if (retval < buffersize) buffer[retval++] = (char)c; /** 3 **/ c = fgetc(file); } if (retval < buffersize) buffer[retval] = 0;/* null terminate */ else buffer[buffersize-1] = 0; /* force null-termination of too long word */ return retval; /* return the length of the copied word */ }
At /** 1 **/ some things should be explained. The two lines say:
while ((c = fgetc(file)) != EOF && !isalnum(c)) /** 1 **/ ;/* loop until we find an alphanumeric character or EOF */
"fgetc" reads a character from the passed file. It returns the character as
an "int", though. The reason is that in case end of file has been reached, it
returns "EOF", and "EOF" must be outside the valid range for
characters (otherwise, what would you do if you read the character that equals EOF?). The
"isalnum" function, declared in
/** 2 **/ "isalnum" is first called on the last character read from the previous loop. If end of file was reached there, "isalnum" will return 0 since "isalnum" returns 0 for "EOF", and "c" will have the value "EOF" if the end of the file was reached. So, if the end of the file is reached, the loop will not be entered, and the function will report 0 characters copied into buffer.
/** 3 **/ Execution only reaches here if "isalnum" returns true, which it does for all letters in the English alphabet (upper and lower case letters,) and the digits.
Now we can use the word file in a small word-reader program:
#include/* printf and size_t */ #include "wordfile.h" int main(int argc, char* argv[]) { char word[64]; /* should be large enough, I hope */ size_t length; if (argc != 2) { printf("Usage: %s filename\n", argv[0]); return 1; } if (!wordfile_open(argv[1])) { printf("The file %s could not be opened as a wordfile\n", argv[1]); return 2; } for (;;) /* 1 "infinite" loop */ { length = wordfile_nextword(word, sizeof(word)); if (length == 0) break; /* leave the loop, since the end of the file is reached */ if (length == sizeof(word)) { printf("*** long word, truncated: %s\n", word); } else { printf("%s\n", word); } } wordfile_close(); return 0; }
Now, save and compile together with wordfile.c as explained in part 5.
As you can see in the small test program, it doesn't need to know anything about how wordfile does its work, only about its interface. It is this technique that is called "encapsulation," since all internals of how the wordfile works is encapsulated by the interface. The good thing with it, is that we can make any changes to wordfile.c we like, for readability, for correcting bugs, for improving performance, or for whatever reason. As long as we still follow the contract set up and documented in wordfile.h, any program making use of wordfile can take advantage of the changes by a simple recompile. It also helps trouble shooting. Since no data about the wordfile is visible outside wordfile.c, any error with the wordfile is either a violation of a precondition, or a bug in wordfile.c.
Back To Homepage