6

I have a long list of urls on my own website listed in a carriage return seperated text file. So for instance:

  • http:/www.mysite.com/url1.html
  • http:/www.mysite.com/url2.html
  • http:/www.mysite.com/url3.html

I need to spawn a number of parallel wgets to hit each URL twice, check for and retrieve a particular header and then save the results in an array which I want to output in a nice report.

I have part of what I want by using the following xargs command:

xargs -x -P 20 -n 1 wget --server-response -q -O - --delete-after<./urls.txt 2>&1 | grep Caching

The question is how do I run this command twice and store the following:

  1. The URL hit
  2. The 1st result of the grep against the Caching header
  3. The 2nd result of the grep against the Caching header

So the output should look something like:

=====================================================
http:/www.mysite.com/url1.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

=====================================================
http:/www.mysite.com/url2.html
=====================================================
First Hit: Caching: MISS
Second Hit: Caching: HIT

And so forth.

Order that the URLS appear isn't necessarily a concern as long as the header(s) are associated with the URL.

Because of the number of URLs I need to hit multiple URLs in parallel not serially otherwise it will take way too long.

The trick is how do I get multiple parallel wgets AND store the results in a meaningful way. I'm not married to using an array if there is a more logical way of doing this (maybe writing to a log file?)

Do any bash gurus have any suggestions for how I might proceed?

Brad
  • 195

3 Answers3

3

Make a small script that does the right thing given a single url (based on terdon's code):

#!/bin/bash

url=$1
echo "=======================================";
echo "$url"
echo "=======================================";
echo -n "First Hit: Caching: ";
wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
echo -n "Second Hit: Caching: ";      
wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo "";

Then run this script in parallel (say, 500 jobs at a time) using GNU Parallel:

cat urls.txt | parallel -j500 my_script

GNU Parallel will make sure the output from two processes are never mixed - a guarantee xargs does not give.

You can find more about GNU Parallel at: http://www.gnu.org/s/parallel/

You can install GNU Parallel in just 10 seconds with:

wget -O - pi.dk/3 | sh 

Watch the intro video on http://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange
  • 5,099
0

One trivial solution would be to log the output from each of the wget commands to a separate file and using cat to merge them afterwards.

l0b0
  • 7,453
0

I will assume that your file is newline, not carriage return separated, because the command you give will not work with an \r separated file.

If your file is using \r instead of \n for line endings, change it to using \n by running this:

perl -i -pe 's/\r/\n/g' urls.txt 

If you are using Windows style (\r\n) line endings, use this:

perl -i -pe 's/\r//g' urls.txt 

Now, once you have your file in Unix form, if you don't mind your jobs not being run in parallel, you can do something like this:

while read url; do 
  echo "=======================================";
  echo "$url"
  echo "=======================================";
  echo -n "First Hit: Caching: ";
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
  echo -n "Second Hit: Caching: ";      
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi; echo "";
done < urls.txt

UPDATE in response to your comment:

If you have 22,000 URLs I can indeed understand why you want to do this in parallel. One thing you could try is creating tmp files:

(while read url; do 
 ( 
  echo "=======================================";
  echo "$url"
  echo "=======================================";
  echo -n "First Hit: Caching: ";
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi;
  echo -n "Second Hit: Caching: ";      
  wget --server-response -q -O - $url 2>&1 | grep Caching >/dev/null
  if [ $? == 0 ]; then echo HIT; else echo MISS; fi; 
  echo ""; ) > `mktemp urltmpXXX` 2>/dev/null&
done < urls.txt )

There are two subshells launched there, the first, (while ... < urls.txt) is justthere to suppress completion messages. The second (( echo "=== ... ) > mktemp urltmpXXX) is there to collect all output for a given URL into one file.

The script above will create 22,000 tmp files called urltmpXXX where the XXX is replaced by as many random characters. Since the tmp files will each have 6 lines of text when they have all finished, you can therefore monitor (and optionally delete the files) with this command:

b=`awk 'END{print NR}' urls.txt`; 
while true; do 
 a=`wc -l urltmp* | grep total | awk '{print $1}'`;     
 if [ $a == $((6 * $b)) ]; then cat urltmp* > urls.out; break; 
  else sleep 1; fi; 
done

Now the other problem is that this will launch 22000 jobs at once. Depending on your system this may or may not be a problem. One way around this is to split your input file and then run the above loop once for each file.

terdon
  • 54,564