How to limit Unix find number of results to handle directories with many files

Question

Is there a way to limit the number of results returned by the find command on a unix system?

We are having performance issues due to an unusually large number of files in some directories.

I'm trying to do something like:

find /some/log -type f -name *.log -exec rm {} ; | limit 5000

amphetamachine · Answer 1 · 2024-04-25T20:38:38.183

You could try something like find [...] |head -[NUMBER]. This will send a SIGPIPE to find when head outputs its however-many lines so that find doesn't continue its search.

Caveat: find outputs files in the order they appear in the directory structure. Most *NIX file systems do not order directories by entry name. This means the results are given in an unpredictable order. find |sort will put the list in the sort order defined by your LC_COLLATE setting -- in most cases, ASCIIbetical order.

Another caveat: It's exceedingly rare to see in the wild, but *NIX filenames can contain newline characters. Many programs get around this by optionally using a NUL byte (\0) as the record separator.

Most *nix text-processing utilities have the option to use a NUL as a record separator instead of a newline. Some examples:

grep -z
xargs -0
find -print0
sort -z
head -z
perl -0

Putting this all together, to safely remove the first 5000 files, in alphabetical order:

find /some/log -type f -name '*.log' -print0 |
sort -z |
head -5000 -z |
xargs -0 rm

* Line breaks here are added for clarity, though either syntax is valid and works the same; you could execute this all on one line (foo | bar | baz) provided you make sure to not delete the | (vertical pipe) separating the commands.

score 6 · Accepted Answer · answered Feb 22 '10 at 21:55

6

It sounds like you're looking for xargs, but don't know it yet.

find /some/log/dir -type f -name "*.log" | xargs rm

answered Feb 22 '10 at 21:55

blahdiblah

5,501

score 1 · Answer 3 · edited Apr 03 '19 at 14:16

If you have a very large number of files in your directories, and/or when using pipes may not apply, etc., for instance because xargs would be limited by the number of arguments allowed by your system, one option is to use the exit status of an exec command as a filter for the next actions, something like:

rm /tmp/count ; find . -type f -exec bash -c 'echo "$(( $(cat /tmp/count) + 1 ))" > /tmp/count' \; -exec bash -c 'test $( cat /tmp/count ) -lt 5000' \; -exec echo "any command instead of echo of this file: {}" \;

The first exec will just increment the counter. The second exec tests the count, if less than 5000, then exits with 0 and the next command is executed. The third exec will do the intended on the file, in this case a simple echo, we can also -print -delete, etc. (I would use -delete instead of -exec rm {} \; for instance.

This is all based on the fact that find actions are executed in sequence assuming the previous one returns 0.

When using the above example, you'd want to make sure /tmp/count is not used by a concurrent process.

[edits following comments from Scott] Thanks a lot Scott for your comments.

Based on them: the number was changed to 5,000 to match the initial thread.

Also: this is absolutely correct that /tmp/count file will still be written 42,000 times (as many times as files being browsed), so "find" will still go through all the 42,000 entries,but will only execute the command of interest 5,000 times. So this command will not avoid browsing the whole and is just presented as an alternate option to usual pipes. Using a memory mapped temporary directory to host this /tmp/count file would seem appropriate.

And besides your comments, some additional edits: Pipes would be simpler in most typical cases.

Please find below more reasons for which pipes would not apply that easily though:

when file names have spaces in them, the "find" exec command would not want to forget to surround the {} with quotes "{}", to support this case,
when the intended command does not allow having all the file names in a raw, for instance, something like: -exec somespecificprogram -i "{}" -o "{}.myoutput" \;

So this example is essentially posted for those around who would have faced challenges with pipes and still do not want to go into a more elaborated programming option.

score 0 · Answer 4 · answered Jan 11 '16 at 21:12

Just |head didn't work for me:

root@static2 [/home/dir]# find . -uid 501 -exec ls -l {} \; | head 2>/dev/null
total 620
-rw-r--r--  1 root   root           55 Sep  8 15:22 08E7384AE2.txt
drwxr-xr-x  3 lamav statlus 4096 Apr 22  2015 1701A_new_email
drwxr-xr-x  3 lamav statlus 4096 Apr 22  2015 1701B_new_email
drwxr-xr-x  3 lamav statlus 4096 May 11  2015 1701C_new_email
drwxr-xr-x  2 lamav statlus 4096 Sep 24 18:58 20150924_test
drwxr-xr-x  3 lamav statlus 4096 Jun  4  2013 23141_welcome_newsletter
drwxr-xr-x  3 lamav statlus 4096 Oct 31  2012 23861_welcome_email
drwxr-xr-x  3 lamav statlus 4096 Sep 19  2013 24176_welco
drwxr-xr-x  3 lamav statlus 4096 Jan 11  2013 24290_convel
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13
find: `ls' terminated by signal 13

(...etc...)

My (definitely not the best) solution:

find . -uid 501 -exec ls -l {} \; 2>/dev/null | head

The disadvantage is that the 'find' itself isn't terminated after required number of lines, and run in background until ^C or end, therefore ideas are welcomed.

sleske · Answer 5 · 2010-02-24T11:31:41.113

find /some/log -type f -name *.log -exec rm {} ; | limit 5000

Well, the command as quoted will not work, of course (limit isn't even a valid command).

But if you run something similar to the find command above, it's probably a classic problem. You're probably having performance problems because find runs rm once for every file.

You want to use xargs, it can combine several files into one command line, so it will invoke rm a limited times for many files at once, which is much faster.

Kamil Maciorowski · Answer 6 · 2022-07-07T18:14:01.377

Your "performance issues" are probably because find … -exec rm {} \; runs one rm per matching file. find … -exec rm {} + should perform better. If your find supports -delete then find … -delete should perform even better.

But your explicit question is [emphasis mine]:

Is there a way to limit the number of results returned by the find command on a Unix system?

If "returned" means "printed to stdout", then find … -print | head … (which cannot handle arbitrary names well) or find … -print0 | head -z … (which is not portable) is the answer.

Still you want to do something with the result. Piping to xargs (like in other answers you got) is fully reliable only if you use null-terminated lines: find … -print0 | head -z … | xargs -0 …. This is not portable.

The following code is a portable* way to make find process (in this case: remove) at most 5000 regular files with names matching *.log under /some/log:

while :; do echo; done | head -n 4999 \
| find /some/log -type f -name '*.log' -exec sh -c '
   for pathname do
      </dev/tty rm "$pathname" \
      && { read dummy || { kill -s PIPE "$PPID"; exit 0; } }
   done
' find-sh {} +

This is how the code works:

find starts sh and passes possibly many pathnames to it as arguments. There may be more than one sh started one after another, the number doesn't matter.
sh attempts to rm files one by one in a loop. After a successful remove operation it tries to read exactly one line from its stdin inherited from find.
while … | head -n 4999 (which could be yes | head -n 4999, but yes is not portable) generates exactly 4999 lines. Unless we run out of files first, exactly 4999 reads will succeed. The read after the 5000th successful move operation will be the first read that fails.
Failed read occurs exactly after the 5000th successful move operation. It causes two things:
- find ($PPID, the parent process of sh) gets SIGPIPE, so it won't start more sh processes;
- the current sh exits, so it won't process more pathnames.

Notes:

To remove 5000 files you need 4999 in the code.
I fixed your flawed -name *.log.
find-sh is explained here: What is the second sh in sh -c 'some shell code' sh?
The solution runs one rm per matching file. It won't perform better than your original code. It's an answer to your question about limiting the number. You asked for it, you got it.
The solution may be adapted to any action, not necessarily rm. In this another answer of mine it's mv, but in general it can be anything (possibly in a form of a huge script). To just print, use printf.
Anything that uses find … -exec foo … {} … or find … | xargs … foo … is prone to a race condition. Between find finding the file and foo doing something, the path to the file may be manipulated, so foo sees a different file than the one tested by find. E.g. if a rogue party removes the file and places a symlink to another file in its place, then foo will possibly work with the wrong file. In case of rm this means removing the malicious symlink, not its target, so not that bad; but if the rogue plants a symlink in place of a subdirectory then rm may actually remove the wrong file. This is especially relevant when running find as root in a directory where others can create and remove files.

-delete provided by GNU find removes the race condition where someone may be able to make you remove the wrong files by changing a directory to a symlink in-between the time find finds a file and rm removes it (see info -f find -n 'Security Considerations for find' for details). This is how you can limit the number of files deleted by -delete in a GNU system:
```
yes | head -n 4999 \
| find /some/log -type f -name '*.log' -delete \
  $ -exec sh -c 'read dummy' find-sh \; -o -quit $
```
The above code runs one sh per deleted file. The below code is somewhat simpler but noisy, it runs one true per deleted file.
```
yes | head -n 4999 \
| find /some/log -type f -name '*.log' -delete $ -ok true \; -o -quit $
```

^{* AFAIK it's portable.}

How to limit Unix find number of results to handle directories with many files

6 Answers6

Linked