9

Not uncommonly I have to count the number of files in a directory, sometimes this runs into the millions.

Is there a better way than just enumerating and counting them with find . | wc -l ? Is there some kind of filesystem call you can make on ext3/4 that is less I/O intensive?

phuclv
  • 30,396
  • 15
  • 136
  • 260
MattPark
  • 1,255

4 Answers4

17

Not a fundamental speed-up but at least something :)

find . -printf \\n | wc -l

You really do not need to pass the list of file names, just the newlines suffice. This variant is about 15 % faster on my Ubuntu 12.04.3 when the directories are cached in RAM. In addition this variant will work correctly with file names containing newlines.

Interestingly this variant seems to be a little bit slower than the one above:

find . -printf x | wc -c

Special case - but really fast

If the directory is on its own file system you can simply count the inodes:

df -i .

If the number of directories and files in other directories than the counted one do not change much you can simply subtract this known number from the current df -i result. This way you will be able to count the files and directories very quickly.

5

I have written ffcnt for exactly that purpose. It retrieves the physical offset of directories themselves with the fiemap ioctl and then scheduling the directory traversal in multiple sequential passes to reduce random access. Whether you actually get a speedup compared to find | wc depends on several factors:

  • filesystem type: filesystems such as ext4 which support the fiemap ioctl will benefit most
  • random access speed: HDDs benefit far more than SSDs
  • directory layout: the higher the number of nested directories, the more optimization potential

(re)mounting with relatime or even nodiratime may also improve speed (for all methods) when the accesses would otherwise cause metadata updates.

the8472
  • 531
0

Use fd instead. It's a fast alternative to find that traverses folders in parallel. Even after deliberately slowing it with lots of options and purging directory cache between benchmark calls it's still 4-12 times faster than find in my case:

$ time find ~ -type f 2>/dev/null | wc -l
  445705
find ~ -type f 2> /dev/null  0.84s user 13.57s system 51% cpu 28.075 total
wc -l  0.03s user 0.02s system 0% cpu 28.074 total

$ time fd -t f -sHI --show-errors . ~ 2>/dev/null | wc -l 445705 fd -t f -sHI --show-errors . ~ 2> /dev/null 2.66s user 14.81s system 628% cpu 2.780 total wc -l 0.05s user 0.05s system 3% cpu 2.779 total

By default fd skips files in hidden folders or .gitignore, and avoids printing out permission errors so it's even far faster than this. To match the default find options -sHI --show-errors was used.

Of course you'll need to install it as it's not there by default, just like the ffcnt solution above, but installation is trivial in all major platforms. fd is written in Rust so it's also easy to bring a statically built binary around to use in other PCs where it's not available

It's possible to tune this further by printing only a new line instead of piping the whole path. In find you can achieve that with -printf '\n'. This isn't currently supported on fd but it's a feature being requested

Update:

It's now possible to do that with the --format option, until --printf is actually implemented

fd --format '' | wc -l

Note that in Ubuntu due to name clashing you'll need to use fdfind instead of fd. You can just alias fd=fdfind to overcome the longer name

Another nice thing about fd is that when running it in an interactive terminal you'll also get nice colorized texts unlike the output from find. And it's also very appropriate to search in git repos because objects in .git/ won't be searched


Another solution is al-dente. It's very limited in feature, just list all the files, but that's probably one of the reasons that make it fast. You have to compile it from source though as it's not packaged, and you have to specify the number of threads to use. But the result is consistently faster than fd

$ TIMEFORMAT=%R

$ time ./dent ~ $(nproc --all) | wc -l >/dev/null 0.050

$ time fd -sHI --show-errors . ~ 2>/dev/null | wc -l >/dev/null 0.141

$ time find ~ 2>/dev/null | wc -l >/dev/null 0.239

phuclv
  • 30,396
  • 15
  • 136
  • 260
0

Actually, on my system (Arch Linux) this command

   ls -A | wc -l

is faster than all of the above:

   $ time find . | wc -l
  1893

   real    0m0.027s
   user    0m0.004s
   sys     0m0.004s
   $ time find . -printf \\n  | wc -l
   1893

   real    0m0.009s
   user    0m0.000s
   sys     0m0.008s
   $ time find . -printf x  | wc -c
   1893

   real    0m0.009s
   user    0m0.000s
   sys     0m0.008s
   $ time ls -A | wc -l
   1892

   real    0m0.007s
   user    0m0.000s
   sys     0m0.004s
MariusMatutiae
  • 48,517
  • 12
  • 86
  • 136