2

I'd like to preface this by saying that I'm very new to the command prompt and I've been only using it for some WGET and YOUTUBE-DL, and that I'm on a Windows 8 PC.

I'd like to get a bunch of links from an html file. The links all start with

https://s-media-cache-ak0.pinimg.com/originals/

and end with

.jpg

Right now I'm using this:

findstr ^https://s-media-cache-ak0.pinimg.com/originals/.*\.jpg index.html > urls.txt

I did some research and I'm using the "range" function of FINDSTR as you can see. But I still get a lot of extra text that I'm not interested in. Is there anyway to trim it down?

1 Answers1

3

As this StackOverflow answer states, you really shouldn't atempt to parse [X]HTML with regex. findstr has very limited regex support in any case.

Use a proper HTML scraper/parser like Xidel instead. A command like the following will do what you're looking for:

xidel <URL or HTML file name> -q -e "//a/extract(@href/resolve-uri(.), 'https:\/\/s-media-cache-ak0\.pinimg\.com\/originals\/.*?\.jpg')[. != '']"
Karan
  • 57,289