How to run sed on over 10 million files in a directory?

Question

I have a directory that has 10144911 files in it. So far I've tried the following:

for f in ls; do sed -i -e 's/blah/blee/g' $f; done

Crashed my shell, the ls is in a tilda but i can't figure out how to make one.

ls | xargs -0 sed -i -e 's/blah/blee/g'

Too many args for sed

find . -name "*.txt" -exec sed -i -e 's/blah/blee/g' {} \;

Couldn't fork any more no more memory

Any other ideas on how to create this kind command? The files don't need to communicate with each other. ls | wc -l seems to work (very slow) so it must be possible.

Dennis Williamson · Accepted Answer · 2011-03-14T07:35:54.807

Give this a try:

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

It will only feed one filename to each invocation of sed. That will solve the "too many args for sed" problem. The -P option should allow multiple processes to be forked at the same time. If 0 doesn't work (it's supposed to run as many as possible), try other numbers (10? 100? the number of cores you have?) to limit the number.

Peter.O · Answer 2 · 2011-03-15T12:14:08.167

I've tested this method (and all the others) on 10 million (empty) files, named "hello 00000001" to "hello 10000000" (14 bytes per name).

UPDATE: I've now included a quad-core run on the 'find |xargs' method (still without 'sed'; just echo >/dev/null)..

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done

Here is a summary of how the provided answers fared when run against the test data mentioned above. These results involve only the basic overheads; ie 'sed' was not called. The sed process will almost certainly be the most time-consuming, but I thought it would be interesting to see how the bare methods compared.

Dennis's 'find |xargs' method, using a single core, took *4 hours 21 mins** longer than the bash array method on a no sed run... However, the multi-core advantage offered by 'find' should outweigh the time differences shown when sed is being called for processing the files...

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+-----------------------------------------------------

score 2 · Answer 3 · answered Mar 14 '11 at 09:27

Another opportunity for the completely safe find:

while IFS= read -rd $'\0' path
do
    file_path="$(readlink -fn -- "$path"; echo x)"
    file_path="${file_path%x}"
    sed -i -e 's/blah/blee/g' -- "$file_path"
done < <( find "$absolute_dir_path" -type f -print0 )

score 1 · Answer 4 · answered Mar 14 '11 at 02:29

1

Try:

ls | while read file; do (something to $file); done

answered Mar 14 '11 at 02:29

Reuben L.

1,062

intuited · Answer 5 · 2011-03-14T18:17:14.083

This is mostly off-topic, but you could use

find -maxdepth 1 -type f -name '*.txt' | xargs python -c '
import fileinput
for line in fileinput.input(inplace=True):
    print line.replace("blah", "blee"),
'

The main benefit here (over ... xargs ... -I {} ... sed ...) is speed: you avoid invoking sed 10 million times. It would be faster still if you could avoid using Python (since python is kind of slow, relatively), so perl might be a better choice for this task. I'm not sure how to do the equivalent conveniently with perl.

The way this works is that xargs will invoke Python with as many arguments as it can fit on a single command line, and keep doing that until it runs out of arguments (which are being supplied by ls -f *.txt). The number of arguments to each invocation will depend on the length of the filenames and, um, some other stuff. The fileinput.input function yields successive lines from the files named in each invocation's arguments, and the inplace option tells it to magically "catch" the output and use it to replace each line.

Note that Python's string replace method doesn't use regexps; if you need those, you have to import re and use print re.sub(line, "blah", "blee"). They are Perl-Compatible RegExps, which are sort of heavily fortified versions of the ones you get with sed -r.

edit

As akira mentions in the comments, the original version using a glob (ls -f *.txt) in place of the find command wouldn't work because globs are processed by the shell (bash) itself. This means that before the command is even run, 10 million filenames will be substituted into the command line. This is pretty much guaranteed to exceed the maximum size of a command's argument list. You can use xargs --show-limits for system-specific info on this.

The maximum size of the argument list is also taken into account by xargs, which limits the number of arguments it passes to each invocation of python according to that limit. Since xargs will still have to invoke python quite a few times, akira's suggestion to use os.path.walk to get the file listing will probably save you some time.

How to run sed on over 10 million files in a directory?

5 Answers5

edit