3

I am using curl to get a URL then writing to a file, like this:

urls=( 
  'https://www.example1.com'
  'https://www.example2.com'
 )

for i in ${urls[@]}; do curl $i & done echo 'stuff'

I have deliberately simplified the code, so the exact problem can be tackled.

Output:

stuff
$curlContents1
$curlContents2

I know why this happens, it’s running asynchronously.

What I want to know

  • I want to run this async cmd with the output the same as it would be if I had ran it sync.
  • This is because running it async gives a nice speed boost

Desired output:

$curlContents1
$curlContents2
stuff

more info

  • my actual problem is a bit different…

What I’m doing is downloading videos then taking the last part of the URL and using it as the file name, how can I use parallel in this example?

The write happens before the download as the downloads are the most time consuming part

arr=(
  'https://www.example1.com/stccdtu.mp4’
  'https://www.example2.com/dyubdf.mp4’
 )

for i in ${arr[@]}; do curl $i > echo $i | sed s'#https://www.example[0-9].com/##'g & done

output:

ll

0 stccdtu.mp4 0 dyubdf.mp4

bad_coder
  • 649
Nickotine
  • 133

2 Answers2

3

With GNU parallel. Basic example:

parallel -j 40 --group --keep-order curl ::: "${urls[@]}"
echo 'stuff'

-j 40 means we assign 40 jobslots, i.e. we allow up to 40 parallel jobs (adjust it to your needs and abilities). If you supply more URLs then the 41st one will be processed after some slot gets available. All URLs will be processed but at any moment there will be at most 40 jobs running in parallel.

Other options used:

--group
Group output. Output from each job is grouped together and is only printed when the command is finished. Stdout (standard output) first followed by stderr (standard error). […]

(source)

which is the default, so usually you don't have to use it explicitly.

--keep-order
-k
Keep sequence of output same as the order of input. Normally the output of a job will be printed as soon as the job completes. […] -k only affects the order in which the output is printed - not the order in which jobs are run.

(source)

Notes:

  • In my example parallel is not in the background and is run synchronously (so echo runs after it); still curls run in parallel, asynchronously.

  • In Debian GNU parallel is in a package named parallel. Basic variant of the tool (from moreutils, at least in Debian) is less powerful.

  • parallel is an external command. If the array is large enough then with parallel … ::: "${urls[@]}" you will hit argument list too long. Use this instead:

    printf '%s\n' "${urls[@]}" | parallel …
    

    It will work because in Bash printf is a builtin and therefore everything before | is handled internally by Bash.

  • ${urls[@]} is properly double-quoted (in your code ${urls[@]} and $i are unquoted, this is wrong).


GNU parallel can call exported Bash functions. This allows us to solve what you called the actual problem:

getvideo() {
curl "$1" > "${1##*/}"
}
export -f getvideo

urls=( 'https://www.example1.com/stccdtu.mp4' 'https://www.example2.com/dyubdf.mp4' )

parallel -j 40 --group --keep-order getvideo ::: "${urls[@]}" echo 'stuff'

If you don't know what ${1##*/} does, read this another answer of mine.

0

Bash shell has the wait command that will pause the script until background jobs are finished.

Waits for each process identified by an ID, which may be a process ID or a job specification, and reports its termination status. If ID is not given, waits for all currently active child processes, and the return status is zero. If ID is a a job specification, waits for all processes in that job's pipeline.

for i in ${urls[@]}; do
   curl $i &
done
wait
echo 'stuff'