35

Suppose I want to remove all files in a directory except for one named "notes.txt". I would do this with the pipeline, ls | grep -v "notes.txt" | xargs rm. Why do I need xargs if the output of the second pipe is the input that rm should use?

For the sake of comparison, the pipeline, echo "#include <knowledge.h>" | cat > foo.c inserts the echoed text into the file without the use of xargs. What is the difference between these two pipelines?

seewalker
  • 723

3 Answers3

50

You are confusing two very different kinds of input: STDIN and arguments. Arguments are a list of strings provided to the command as it starts, usually by specifying them after the command name (e.g. echo these are some arguments or rm file1 file2). STDIN, on the other hand, is a stream of bytes (sometimes text, sometimes not) that the command can (optionally) read after it starts. Here are some examples (note that cat can take either arguments or STDIN, but it does different things with them):

echo file1 file2 | cat    # Prints "file1 file2", since that's the stream of
                          # bytes that echo passed to cat's STDIN
cat file1 file2    # Prints the CONTENTS of file1 and file2
echo file1 file2 | rm    # Prints an error message, since rm expects arguments
                         # and doesn't read from STDIN

xargs can be thought of as converting STDIN-style input to arguments:

echo file1 file2 | cat    # Prints "file1 file2"
echo file1 file2 | xargs cat    # Prints the CONTENTS of file1 and file2

echo actually does more-or-less the opposite: it converts its arguments to STDOUT (which can be piped to some other command's STDIN):

echo file1 file2 | echo    # Prints a blank line, since echo doesn't read from STDIN
echo file1 file2 | xargs echo    # Prints "file1 file2" -- the first echo turns
                                 # them from arguments into STDOUT, xargs turns
                                 # them back into arguments, and the second echo
                                 # turns them back into STDOUT
echo file1 file2 | xargs echo | xargs echo | xargs echo | xargs echo    # Similar,
                                 # except that it converts back and forth between
                                 # args and STDOUT several times before finally
                                 # printing "file1 file2" to STDOUT.
11

cat takes input from STDIN and rm does not. For such commands you need xargs to iterate through STDIN line by line and execute the commands with command line parameters.

Alex P.
  • 2,763
3

Understand xargs with a minimal example

Before looking into why xargs is useful, let's first make sure that we understand what xargs does with some minimal examples.

When you do either of:

printf '1 2 3 4' | xargs rm
printf '1\n2\n3\n4' | xargs rm

xargs parses the input string coming from stdin, and separates arguments by whitespace, somewhat like Bash, though the details are a bit different. In particular, spaces and newlines are treated differently if you use xargs -L instead of -n: https://stackoverflow.com/questions/6527004/why-does-xargs-l-yield-the-right-format-while-xargs-n-doesnt/6527308#6527308

Because we are not using -L however, both of the above calls are equivalent, and xargs would parse out four arguments: 1, 2, 3 and 4.

Then, xargs takes the arguments it parsed out, and feeds them to the program we are calling with. In our case, it is the executable /usr/bin/rm.

By default, xargs does not specify how many arguments it is going to pass at a time, and unless we pass some flags, and it could be more than one. So the above xargs calls could be equivalent to either:

rm 1 2 3 4

or:

rm 1 2
rm 3 4

or:

rm 1
rm 2
rm 3
rm 4

and we generally don't know which one of the above happened because for rm, the end result would be the same: files 1, 2, 3, and 4 would be removed, so we don't care much about which one xargs is doing anyways, so we just let it do its thing.

It could make a difference for other programs, e.g. /usr/bin/echo however, where a newline is added for every call.

Control how many arguments are passed at a time

We can control how many arguments are passed at once to xargs with certain flags.

The simplest one is -n, which limits the maximum number of arguments to be passed at a time.

Then, we can try to observe what is going on by using /usr/bin/echo instead of /usr/bin/rm, because echo, unlike rm treats echo 1 2 differently than echo 1; echo 2 as it adds a newline for each call.

With this in mind, if we run:

printf '1 2 3 4' | xargs -n2 echo

it supplies 2 arguments at a time to echo and is equivalent to:

echo 1 2
echo 3 4

which produces:

1 2
3 4

And if we instead run:

printf '1 2 3 4' | xargs -n1 echo

it supplies 1 argument at a time to echo and is equivalent to:

echo 1
echo 2
echo 3
echo 4

which produces:

1
2
3
4

Another way is to use -L instead of -n. -L is like -n but only splits by newlines, not spaces: https://stackoverflow.com/questions/6527004/why-does-xargs-l-yield-the-right-format-while-xargs-n-doesnt/6527308#6527308

And another common way to control the number of arguments is -I which implies -L1, e.g.:

printf '1\n2\n3\n4\n' | xargs -I% echo a % b

is equivalent to:

echo a 1 b
echo a 2 b
echo a 3 b
echo a 4 b

and so produces:

a 1 b
a 2 b
a 3 b
a 4 b

Alternative approaches and why xargs is superior

Now that we understand what xargs does, let's consider the alternatives and why xargs is better.

Suppose we have a file:

notes.txt

1
2
3
4

Instead of:

xargs < notes.txt | rm

we might want to use:

rm $(cat notes.txt)

which expands to:

rm 1 2 3 4

However, this is problematic because there is a maximum size for the command line arguments of a Linux program so it could fail if there were too many arguments in notes.txt.

xargs knows about this, and automatically splits arguments intelligently to avoid having too many at a time.

And there is no maximum size to streams like stdin, so things can work to arbitrary sizes like this. The reason why it works is that streams can be read little by little with the read() system call while CLI arguments must be loaded all at once into virtual memory, so there is no need for a hard maximum on stream sizes.

Another simple approach you could try would be:

while IFS="" read -r p || [ -n "$p" ]
do
  rm "$p"
done < notes.txt

from: https://stackoverflow.com/questions/1521462/looping-through-the-content-of-a-file-in-bash but this requires a lot of typing, and could be slower because:

  • it calls the /usr/bin/rm executable once for every argument, rather than fewer times with a bunch of arguments
  • more time is spent on the bash while loop, as opposed to the C-coded xargs code

To make xargs even more interesting, the GNU version that a -P option for parallel operation!

Related: https://unix.stackexchange.com/questions/24954/when-is-xargs-needed