51

I'm downloading a site with wget and a lot of the links have queries attached to them, so when I do this:

wget -nv -c -r -H -A mp3 -nd http://url.to.old.podcasts.com/

I end up with a lot of files like this:

1.mp3?foo=bar
2.mp3?blatz=pow
3.mp3?fizz=buzz

What I'd like to end up with is:

1.mp3
2.mp3
3.mp3

This is all taking place in ubuntu linux and I've got wget 1.10.2.

I know I can do this after I get everything via a script to rename everything. However I'd really like a solution from within wget so I can see the correct names as the download is happening.

Can anyone help me unravel this?

9 Answers9

35

If the server is kind, it might be sticking a Content-Disposition header on the download advising your client of the correct filename. Telling wget to listen to that header for the final filename is as simple as:

wget --content-disposition

You'll need a newish version of wget to use this feature.

I have no idea how well it handles a server claiming a filename of '/etc/passwd'.

Filox
  • 351
  • 3
  • 3
21

I realized after processing a large batch that I should have instructed wget to ignore the query strings. I did not want to do it over again so I made this script which worked for me:

# /bin/bash
for i in `find $1 -type f`
do
    mv $i `echo $i | cut -d? -f1`
done

Put that in a file like rmqstr and chmod +x rmqstr Syntax: ./rmqstr <directory (defaults to .)>

It will remove the query strings from all filenames recursively.

jox
  • 313
5

I think, in order to get wget to save as a filename different than the URL specifies, you need to use the -O filename argument. That only does what you want when you give it a single URL -- with multiple URLs, all downloaded content ends up in filename.

But that's really the answer. Instead of trying to do it all in one wget command, use multiple commands. Now your workflow becomes:

  1. Run wget to get the base HTML file(s) containing your links;
  2. Parse for URLs;
  3. Foreach URL ending in mp3,
    1. process URL to get a filename (eg turn http://foo/bar/baz.mp3?gargle=blaster into baz.mp3
    2. (optional) check that filename doesn't exist
    3. run wget <URL> -O <filename>

That solves your problem, but now you need to figure out how to grab the base files to find your mp3 URLs.

Do you have a particular site/base URL in mind? Steps 1 and 3 will be easier to handle with a concrete example.

quack quixote
  • 43,504
3

Look at these two commands I created to clone a site, and after clone is done, you can execute second command.

The second command will take a look in entire clone, search for "?" file pattern names, and will remove query string from the file name.

# Clone entire site.
    wget --content-disposition --execute robots=off --recursive --no-parent --continue --no-clobber http://example.com

# Remove query string from a static resource.
for i in `find $1 -type f -name "*\?*"`; do mv $i `echo $i | cut -d? -f1`; done

(See it in GitHub Gist.)

MarianD
  • 2,726
3

In order to properly rename the files you have to account for spaces in file name, which is a possibility and will mess the for loop.

Here is an improved version :

find . -type f -name "*\?*" -print0 | 
while IFS= read -r -d '' file; 
do 
    mv -f "$file" "`echo $file | cut -d? -f1`"; 
done

This ensures that files with spaces are properly handled by the loop (using \0 as delimiter) and by the mv command (double quotes)

There were only a couple complex cases where it did not work but otherwise this is the best option.

TrYde
  • 41
1

I have a similar approach as @Gregory Wolf because his code always created error messages like this:

mv: './file' and './file' are the same file

Thus I first check if there is a query string in the filename before moving the file:

for f in $(find $1 -type f); do
    if [ $f = ${f%%\?*} ]; then continue; fi
    mv "${f}" "${f%%\?*}"
done

This will recursively check every file and remove all query strings in their filenames if available.

KittMedia
  • 111
  • 4
1

This answer isn't intended to be a method to rename files after downloading - that's been answered. Instead I want to suggest what to do when you realise the files downloaded with a query string are mere duplicates that you want to exclude. This can happen in for example a WordPress site where paths normally don't include a query string. All you need to do here is not attempt to download any link including a '?', eg:

wget -km --reject-regex '.*\?.*' https://tomlehrersongs.com/
1

so I can see the correct names as the download is happening.

OK. Use wget as you normally do; use the post-wget script that you normally use, but process wget's output so that it's easier on the eyes:

#! /bin/sh
exec wget --progress=bar:force $* 2>&1 | \
  perl -pe 'BEGIN { $| = 1 } s,(?<=`)([^\x27?]+),\e[36;1m$1\e[0m, if /^Saving/'
cgi-cut # rename files

This will still show the ?foo=bar as you download, but will display the rest of the name in bright cyan.

ayrnieu
  • 287
-2

Even easier is this: https://unix.stackexchange.com/questions/196253/how-do-you-rename-files-specifically-in-a-list-that-wget-will-use

This suggests a method that essentially uses wget's rename function (can be altered to include directory) for multiple files. See the second version proposed.