3

I was just wondering if anybody knows any best practices or if there is any documentation about this topic:

The scenario is searching / grepping in the log files. To make my point I will use ls. So let's say that I run ls to list a series of files within the directory

/var/log/remote/serverX.domain.local/ps/ps2.log.2014-mm-dd.gz

Where mm and dd are month and day numbers, there are also a whole battery of servers besides serverX (for the example I use 4,5,9,10 (these are real servers)

I have run ls with time using first a list of parameters in curly braces and changed it later to an asterisk to see the differences. I of course did not expect the asterisk to perform better.

   emartinez@serverlog:~$ time ls /var/log/remote/server{4,5,9,10}.domain.local/ps/ps2.log.2014-10-0{1,2}.gz
    /var/log/remote/server10.domain.local/ps/ps2.log.2014-10-01.gz  
    ...
    /var/log/remote/server5.domain.local/ps/ps2.log.2014-10-02.gz

real    0m0.004s
user    0m0.010s
sys     0m0.000s

Then I replace the last curly brace with an asterisk:

time ls /var/log/remote/server{4,5,9,10}.domain.local/ps/ps2.log.2014-10-0*.gz

And I get the following stats:

    real      0m0.028s
    user      0m0.020s
    sys   0m0.020s

This is a lot of difference even though there are only 2 options as the available dates are only 01 and 02 of October.

I did run the test again but this time I replaced the months for a list {1..12} being the results consistent:

ps2.log.2014-{1..12}-0{1,2}.gz : real 0m0.010s
ps2.log.2014-{1..12}-0*.gz     : real 0m0.168s

That's a LOT of difference for just one single asterisk!!! It does make sense that this is slower, but are there any benchmarks on how much slower and are there any best practices outlined somewhere?

runlevel0
  • 1,070

2 Answers2

3

It may seem like prefix-* should be easy to turn into, for example, prefix-1 prefix-2, since we're used to seeing directory listings sorted. But it turns out that very few filesystems can actually produce sorted filename listings, and furthermore that there is no standard API for asking for sorted filename listings.

If a program -- such as ls, or, for that matter, bash -- needs a list of filenames, it needs to read the whole directory listing, which will be produced in some random order (often the order is related to creation time; sometimes it's based on a hash of the filename; but in pretty well no case is it a simple alphabetic order). So in order to resolve prefix-*, you need to read the entire directory and check every filename against the pattern. Since the most costly part of that procedure is reading the directory, it makes little difference how complex the pattern is or how many filenames match the pattern.

In summary, pathname expansion ("resolving globs") is going to be slow in a large directory. That's a reason to avoid large directories, rather than a reason to avoid globs.

But there's another important datapoint: prefix-{1,2} is not pathname expansion. It's "brace expansion" and it's an extension to the Posix shell standard (although almost all shells implement it). There are a number of differences between brace expansion and pathname expansion, but one important and relevant difference is that brace expansion does not depend on the existence of files. Brace expansion is a simple string operation.

Consequently, prefix-{1,2} will always expand to prefix-1 prefix-2, regardless of whether those files exist or not. That means it can be expanded without reading the directory and without stating any file. Clearly, that's going to be fast. But there's a downside: there's no way to tell whether the result corresponds to real files.

Consider the following simple example:

$ mkdir test && cd test
$ touch file1 file2 file4
$ ls file*
file1 file2 file4
$ ls file[1234]
file1 file2 file4
$ ls file{1,2,3,4}
ls: cannot access file3: No such file or directory
file1 file2 file4

Final point: Pathname expansion is done by the shell, not by ls. With pathname expansion, we could just as well use echo:

$ echo file*
file1 file2 file4
$ echo file[1234]
file1 file2 file4

And echo will produce the list somewhat faster, because all echo needs to do is print its arguments, while ls (which receives the same arguments) has to stat each argument in order to verify that it is a file. That stat -- which is not a cheap call -- is entirely redundant in the case of a pathname expansion, because the shell has already used the directory listing in order to filter the file list and therefore every filename passed to ls is known to exist. (Unless the glob didn't match any files at all.)

In addition, echo is a bash built-in, so it can be invoked without creating a child process.

In the case of brace expansion, though, echo does not produce the same result:

$ echo file{1,2,3,4}
file1 file2 file3 file4

So we could use ls, redirecting its error output to the bit bucket:

$ ls file{1,2,3,4}
file1 file2 file4

and in this case, the stat calls are not redundant because the shell never validated the filenames.

Unless your directories are really huge, none of this will make much difference and the glob will be a lot easier to write. If your directories are really huge, you should consider splitting them into smaller sub-directories.

For example, instead of paths like:

/var/log/remote/serverX.domain.local/ps/ps2.log.2014-mm-dd.gz

you could use:

/var/log/remote/serverX/domain.local/ps/ps2.log.2014-mm-dd-gz

And if you are keeping the logs forever, you might want to extract the year to avoid infinitely increasing directory size:

/var/log/remote/2014/serverX/domain.local/ps/ps2.log.2014-mm-dd-gz

(2014 is deliberately repeated.)

Sharding the directories will usually be a big win because it provides a mechanism to optimise globbing. As mentioned above, the shell cannot optimize

/var/log/remote/server[2357].domain.local/ps/ps2.log.2014-10-*-gz

but it can optimize

/var/log/remote/server[2357]/domain.local/ps/ps2.log.2014-10-*-gz

In the second case, server[2357] only needs to be matched against the directory names, and once that is done, ps2.log.2014-10-*-gz only needs to be matched against the filenames in the matched directories.

rici
  • 4,003
1

Shell expansion is always performed in a particular order; brace expansion is performed first, file name expansion is performed last.

Thus, a command like

echo {1..3}*

first gets expanded to

echo 1* 2* 3*

then, file name expansion is performed for 1*, 2* and 3*. Each expansion involves going through all file names in the directory and comparing them against the pattern.

As the number of words and/or the number of files in the directory grow, this becomes gradually slower. Even in an empty directory,

shopt -s nullglob  # print nothing for non-matching words
echo {1..1000000}* # prints nothing
shopt -u nullglob  # back to the default

takes almost five seconds on my machine. This is not at all surprising if you consider that file name expansion is performed one million times...

A much faster alternative is to avoid combining both types of shell expansion whenever possible.

The command

echo [1-1000000]* # also prints nothing

searches for the same file names, but it uses a single pattern. This takes 33 milliseconds on my machine.

Using square brackets instead of curly brackets has additional benefits:

$ touch 13
$ echo {1..20}*
13 13
$ echo [1..20]*
13

The first approach found the file twice, since it matches the patterns 1* and 13*. This doesn't happen with "pure" file name expansion.

Dennis
  • 50,701