11

I want to process many files and since I've here a bunch of cores I want to do it in parallel:

for i in *.myfiles; do do_something $i `derived_params $i` other_params; done

I know of a Makefile solution but my commands needs the arguments out of the shell globbing list. What I found is:

> function pwait() {
>     while [ $(jobs -p | wc -l) -ge $1 ]; do
>         sleep 1
>     done
> }
>

To use it, all one has to do is put & after the jobs and a pwait call, the parameter gives the number of parallel processes:

> for i in *; do
>     do_something $i &
>     pwait 10
> done

But this doesn't work very well, e.g. I tried it with e.g. a for loop converting many files but giving me error and left jobs undone.

I can't belive that this isn't done yet since the discussion on zsh mailing list is so old by now. So do you know any better?

math
  • 2,693

5 Answers5

15

A makefile is a good solution to your problem. You could program this parallel execution in a shell, but it's hard, as you've noticed. A parallel implementation of make will not only take care of starting jobs and detecting their termination, but also handle load balancing, which is tricky.

The requirement for globbing is not an obstacle: there are make implementations that support it. GNU make, which has wildcard expansion such as $(wildcard *.c) and shell access such as $(shell mycommand) (look up functions in the GNU make manual for more information). It's the default make on Linux, and available on most other systems. Here's a Makefile skeleton that you may be able to adapt to your needs:

sources = $(wildcard *.src)

all: $(sources:.src=.tgt)

%.tgt: %.src do_something $< $$(derived_params $<) >$@

Run something like make -j4 to execute four jobs in parallel, or make -j -l3 to keep the load average around 3.

9

I am not sure what your derived arguments are like. But with GNU Parallel http:// www.gnu.org/software/parallel/ you can do this to run one job per cpu core:

find . | parallel -j+0 'a={}; name=${a##*/}; upper=$(echo "$name" | tr "[:lower:]" "[:upper:]");
   echo "$name - $upper"'

If what you want to derive is simply changing the .extension the {.} may be handy:

parallel -j+0 lame {} -o {.}.mp3 ::: *.wav

Watch the intro video to GNU Parallel at http://www.youtube.com/watch?v=OpaiGYxkSuQ

Ole Tange
  • 446
7

Wouldn't using the shell's wait command work for you?

for i in *
do
    do_something $i &
done
wait

Your loop executes a job then waits for it, then does the next job. If the above doesn't work for you, then yours might work better if you move pwait after done.

3

Why has nobody mentioned xargs yet?

Assuming you have exactly three arguments,

for i in *.myfiles; do echo -n $i `derived_params $i` other_params; done | xargs -n 3 -P $PROCS do_something

Otherwise use a delimiter (null is handy for that):

for i in *.myfiles; do echo -n $i `derived_params $i` other_params; echo -ne "\0"; done | xargs -0 -n 1 -P $PROCS do_something

EDIT: for the above, each parameter should be separated by a null character, and then the number of parameters should be specified with the xargs -n.

0

I tried some of the answers. They make the script a bit more complex than that is needed. Ideally using parallel or xargs would be preferable however if the operations inside the for loop is complicated it could be problematic to create a big and long lines files to supply to parallel. instead we could use source as follows

# Create a test file 
$ cat test.txt
task_test 1
task_test 2

# Create a shell source file 
$ cat task.sh
task_test()
{
    echo $1
}

# use the source under bash -c 
$ cat test.txt | xargs -n1 -I{} bash -c 'source task.sh; {}'
1
2

Thus for your problem solution would look like

for i in *.myfiles; echo " do_something $i `derived_params $i` other_params
" >> commands.txt ; done

define do something as do_something.sh

do_something(){
process $1
echo $2 
whatever $3 

}

execute with xarg or gnu parallel

   cat commands.txt | xargs -n1 -I{} -P8 bash -c 'source do_something.sh; {}'

I assume functional independence of iterations of for is implied.