Mass grabing part of HTML source code using shell scripts

Question

From this page, a radio show http://www.ellinofreneianet.gr/sounds.php?s=0&p=10&o=l I want to download all the recorded shows.

They are all this type of pages http://www.ellinofreneianet.gr/sound.php?id=7101
and I want to grab from all these 7 thousand pages the line 422 of the source code where the download link is located.
It can be achieved by not line grabbing too, regular expression ".=podcast/." works too.

How to grab the line 422 of every page of that type OR get the "=podcast/****.mp3" part using shell scripts/commands?

Volker Siegel · Accepted Answer · 2014-09-17T16:53:25.317

Something like this?

for i in {7101..7200} ; do  wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | grep ".=podcast/." ; done

The wget options are -q quiet, show no progress etc, and -O - write output to stdout.

Not every page has a mp3 link there; Some even ones show a page which could be the 404 error page. The pages starting from 0 also seem empty.

The empty pages have URLs ending in podcast/", so we can exclude them with matching strings which don't have a " there:

... | grep ".=podcast/[^\"]"

To get only the .mp3 urls, use

... | grep -o 'bitsnbytesplayer.php.*\.mp3'

You found out yourself how to output the page URL before each mp3 URL. Here's an optimiset variant of that, using only one HTTP request per page:

for i in {7100..7200} ; do \
    wget -q -O - http://www.ellinofreneianet.gr/sound.php\?id\=$i | \
    grep -o 'bitsnbytesplayer.php.*\.mp3' && \
    echo http://www.ellinofreneianet.gr/sound.php\?id\=$i ; done | sed -n 'h;n;p;g;p'

The && echo ... prints the URL if the grep before found an mp3 url. The sed command switches the order of the line pairs.

Mass grabing part of HTML source code using shell scripts

1 Answers1