The blog post grep your way to freedom considers “what happens when you grep a file and append the contents to the same file”.
Before moving on to investigate BSD grep, the author observes that GNU grep will stop if it detects that a grep process starts grepping its own output.
Indeed, if I try:
touch test.txt
grep something test.txt > test.txt
I get:
grep: input file ‘test.txt’ is also the output
I was once caught out by a grep feedback loop with a version which obviously didn’t have this check. A script I wrote to parse some test results overnight caught its own output in the glob. The result was a 1 TB output file, a temporarily unusable development machine and some displeased colleagues.
Because of this mishap I was interested in understanding how the developers of GNU grep now prevent this.
I grabbed the source code for grep 3.1 (the version installed on my system) and began grepping.
Finding the check responsible for the loop detection was painless thanks to the printed warning text. The check can be seen below along with the verbose comment that precedes it.
/* If there is a regular file on stdout and
the current file refers to the same i-node,
we have to report the problem and skip it.
Otherwise when matching lines from some
other input reach the disk before we open
this file, we can end up reading and
matching those lines and appending them to
the file from which we're reading. Then
we'd have what appears to be an infinite
loop that'd terminate only upon filling the
output file system or reaching a quota.
However, there is no risk of an infinite
loop if grep is generating no output, i.e.,
with --silent, --quiet, -q.
Similarly, with any of these:
--max-count=N (-m) (for N >= 2)
--files-with-matches (-l)
--files-without-match (-L)
there is no risk of trouble.
For --max-count=1, grep stops after
printing the first match, so there is no
risk of malfunction. But even
--max-count=2, with input==output, while
there is no risk of infloop, there is a
race condition that could result in
"alternate" output. */
if (!out_quiet &&
list_files == LISTFILES_NONE &&
1 < max_count &&
S_ISREG (st.st_mode) &&
SAME_INODE (st, out_stat))
{
if (! suppress_errors)
{
error (0, 0,
_("input file %s"
"is also the output"),
quote (input_filename ()));
}
errseen = true;
goto closeout;
}
In the if statement this condition is the most important:
SAME_INODE (st, out_stat))
which expands to the following:
#define SAME_INODE(a, b) \
((a).st_ino == (b).st_ino \
&& (a).st_dev == (b).st_dev)
In this scope st
and out_stat
are structures of type struct stat
. They are passed to the fstat()
function and populated for the file descriptors of the file to be searched (st
) and STDOUT_FILENO
(st_out
).
With these (sizeable) breadcrumbs and some manpage searching it was not too difficult to determine how grep detects when there is a risk of a feedback loop.
When a C program runs there are three standard input/output streams that are available via file descriptors. They are:
stdin
stream with file descriptor value STDIN_FILENO
(0
)stdout
stream with file descriptor value STDOUT_FILENO
(1
)stderr
stream with file descriptor value STDERR_FILENO
(2
)Typically these steams all map to the terminal you are running your program in, be it a tty, pty, etc. However this does not always have to be the case.
When using the Bash shell for instance, if you perform a redirection such as ./program 1> stdout.txt
, Bash will prepare the correct file descriptors for you prior to executing your program. If you are interested, there is a summary of how it does this here.
Each file in Linux has an inode (man 7 inode
) associated with it. This is a collection of metadata about the file, such as the user id of the file’s owner and last access time.
This inode information can be obtained from a given file descriptor with the fstat()
system call (see man 2 fstat
) which will populate a structure of type struct stat
.
Importantly for us, the inode / struct stat
also contains:
stuct stat.st_ino
.struct stat.st_dev
.The (stat.st_ino
,stat.st_dev
) pair will uniquely identify a file across all available file systems.
To prevent feedback loops Grep compares the inode information of the STDOUT_FILENO
file descriptor with the inode information of the file to be searched. If they are the same (and certain other conditions are met) it prints an error message and exits.
The following code will print to stderr
the device id and inode number for the STDIN_FILENO
, STDOUT_FILENO
and STDERR_FILENO
file descriptors. Additionally if an optional file is provided as a command line argument it’s device id and inode number will also be printed.
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdbool.h>
#define VERBOSE false
#define PRINT(FMT, ...) \
fprintf(stderr, FMT "\n", __VA_ARGS__)
#define PRINT_STAT(STAT, FMT) \
PRINT(#STAT ": " FMT, buf.STAT)
static int print_fstat(int fd, char *fdname)
{
int rtn = 0;
struct stat buf;
PRINT("==== name: %s fd: %d ====",
fdname, fd);
memset(&buf, 0, sizeof(buf));
if ((rtn = fstat(fd, &buf)))
{
return rtn;
}
PRINT_STAT(st_dev, "%lu");
PRINT_STAT(st_ino, "%lu");
if (!VERBOSE)
{
return rtn;
}
PRINT_STAT(st_mode, "0x%x");
PRINT_STAT(st_nlink, "%lu");
PRINT_STAT(st_uid, "%u");
PRINT_STAT(st_gid, "%u");
PRINT_STAT(st_rdev, "%lu");
PRINT_STAT(st_size, "%ld");
PRINT_STAT(st_blksize, "%ld");
PRINT_STAT(st_blocks, "%ld");
return rtn;
}
int main(int argc, char **argv)
{
int user_file = -1;
print_fstat(STDIN_FILENO, "STDIN_FILENO");
print_fstat(STDOUT_FILENO, "STDOUT_FILENO");
print_fstat(STDERR_FILENO, "STDERR_FILENO");
if (argc != 2)
{
return 0;
}
user_file = open(argv[1],
O_RDONLY | O_CREAT,
S_IRUSR | S_IWUSR);
if (user_file < 0)
{
PRINT("Open of (%s) failed", argv[1]);
return user_file;
}
print_fstat(user_file, argv[1]);
return close(user_file);
}
The following is an example of invoking the above program with the current terminal as the optional file to inspect:
./redirect_test.exe $(tty)
The output of this can be seen below. Note that the (st_dev
, st_ino
) combination for stdin
, stdout
, stderr
and the current terminal are all identical.
==== name: STDIN_FILENO fd: 0 ====
st_dev: 21
st_ino: 10
==== name: STDOUT_FILENO fd: 1 ====
st_dev: 21
st_ino: 10
==== name: STDERR_FILENO fd: 2 ====
st_dev: 21
st_ino: 10
==== name: /dev/pts/7 fd: 3 ====
st_dev: 21
st_ino: 10
The following command redirects stdout
to test.txt
, with test.txt
also being the optional file to inspect:
./redirect_test.exe test.txt > test.txt
The output below shows that stdin
and stderr
reference the same file (the terminal), while stdout
and test.txt
both reference a second file (test.txt
).
==== name: STDIN_FILENO fd: 0 ====
st_dev: 21
st_ino: 10
==== name: STDOUT_FILENO fd: 1 ====
st_dev: 51
st_ino: 8656017
==== name: STDERR_FILENO fd: 2 ====
st_dev: 21
st_ino: 10
==== name: test.txt fd: 3 ====
st_dev: 51
st_ino: 8656017