Reading Between the Lines
Today we’ll be talking about caveats when reading lines from a file.
cloder@ recently discovered a series of fgets(3) misuses:
char buf[1024];
fgets(buf, sizeof(buf), fp);
/* Nuke newline. */
buf[strlen(buf) - 1] = '\0';
fgets(3) includes a newline character from each line,
but only if one was found.
This is why most code truncates the string after calling fgets(3).
There are three things wrong with this example:
- fgets(3) can fail and return NULL.
- buf[strlen(buf) - 1] might not be a newline,
truncating a valid character.
This can happen if the file did not end in a newline,
if the buffer was too small to store the line, or - buf might contain binary data and start with a NUL character,
causing strlen(buf) to be zero.
This, in turn, causes an out-of-bounds write.
The following example, copied directly from the man page,
checks for all three errors:
char buf[1024];
if (fgets(buf, sizeof(buf), fp) != NULL) {
if (buf[0] != '\0' && buf[strlen(buf) - 1] == '\n')
buf[strlen(buf) - 1] = '\0';
}
Note that lines containing NUL characters cannot be read reliably with
fgets(3), since we are using strlen(3) to check the line length.
Some of these issues can be avoided by using fgetln(3).
fgetln(3) allows for arbitrarily long lines and
returns the number of characters read.
Like fgets(3), newline characters must be removed manually;
while this is easier to do with fgets(3) because the length is returned,
a check must still be performed.
Here is an example, copied directly from the man page:
char *buf, *lbuf;
size_t len;
lbuf = NULL;
while ((buf = fgetln(fp, &len))) {
if (buf[len - 1] == '\n')
buf[len - 1] = '\0';
else {
/* EOF without EOL, copy and add the NUL */
if ((lbuf = malloc(len + 1)) == NULL)
err(1, NULL);
memcpy(lbuf, buf, len);
lbuf[len] = '\0';
buf = lbuf;
}
printf("%s\n", buf);
}
free(lbuf);
A few things to note here:
- There is no strlen(3) call,
so the only thing affected by NUL characters is printf(3). - fgetln(3) cannot return successfully with a zero-length string,
so the buffer cannot be accessed with a negative index. - The free(3) is correctly placed,
since it is only possible to call malloc(3) in the last iteration.
For a higher level of abstraction and more knobs, check out fparseln(3),
which takes care of newline characters automatically and can recognize
escape characters, comment characters, and continuation characters.
Unfortunately, fparseln(3) has more overhead, is less portable,
and each line must be freed afterwards.
It is also impossible to check if the file ends in a newline;
to reliably read every character in a file, use fgetln(3).
For more information, read the man pages:
fgets(3),
fgetln(3), and
fparseln(3).
Pay attention to the CAVEATS sections for fgets(3) and fgetln(3).