Questions tagged [web-crawler]
77 questions
30
votes
5 answers
Convert web pages to one file for ebook
I want to download HTMLs (example: http://www.brpreiss.com/books/opus6/) and join it to one HTML or some other format that i can use on ebook reader. Sites with free books don't have standard paging, they're not blogs or forums, so don't know how to…
Hrvoje Hudo
- 582
26
votes
2 answers
How to crawl using wget to download ONLY HTML files (ignore images, css, js)
Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files.
Google searches are completely useless.
Here's a command I've tried:
wget…
Nathan J.B.
- 723
21
votes
1 answer
How to save all files/links from a telegram chat/channel?
I want to save ALL http(s) links and/or files, posted to some telegram chat (private or group) or channel (like mailing list).
I need an analog of TumblOne (for tumblr) VkOpt (able to save chating history in vk.com ) or jDownloader (for file…
WallOfBytes
- 437
17
votes
4 answers
Using Wget to Recursively Crawl a Site and Download Images
How do you instruct wget to recursively crawl a website and only download certain types of images?
I tried using this to crawl a site and only download Jpeg images:
wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg…
Cerin
- 9,652
14
votes
2 answers
Why is @ in email address sometimes written as [at] on webpages?
Why is @ sometimes in webpages written as [at]? Does it have any specific reason ?
Sai
- 167
12
votes
4 answers
How "legal" is site-scraping using cURL?
Recently I was experimenting with the cURL, and I found lot is possible with it. I built a small script that crawls a musical site, which plays online songs. On the way of my experiment, I found that it is possible to crawl the song source also..…
Chetan Sharma
- 487
7
votes
1 answer
wget: recursively retrieve urls from specific website
I'm trying to recursively retrieve all possible urls (internal page urls) from a website.
Can you please help me out with wget? or is there any better alternative to achieve this?
I do not want to download the any content from the website, but just…
abhiomkar
- 171
6
votes
3 answers
Is it possible to discover all the files and sub-directories of a URL?
I wonder if there is a software I can use to discover all the files and sub-directories given a URL?
For example, given www.some-website.com/some-directory/, I would like to find all the files in /some-directory/ directory as well as all…
Mark
- 63
6
votes
4 answers
What do I use to download all PDFs from a website?
I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack,…
user385496
4
votes
2 answers
How can I scrape specific data from a website
I'm trying to scrape data from a website for research.
The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few…
Stoney
- 244
4
votes
1 answer
Extract data from an online atlas
There is an online atlas that I would like to extract values from. The atlas provides a tool ('Query') to extract values when you click a location or enclose a region on the map, or you can specify the latitude/longitude of a point where you want…
KAE
- 1,919
4
votes
2 answers
Why website copy tools like Cyotek WebCopy and HTTrack cannot find files that search engines like Google can?
I would like to keep the target website private, but here are some details:
It's a personal (as in single-author) public documentation / portfolio / blog sort of website
It seems to be hosted using Apache
The contents are static as far as I can…
Den
- 143
4
votes
2 answers
Tool to recursivly convert a HMTL file to PDF?
Are there any tools which not only convert a HTML file to PDF but also follow links, so that in the end I get 1(!) PDF file which contains all html files?
user27076
- 234
4
votes
1 answer
Finding pages on a webpage that contain a certain link
Google does a good jobs finding relevant information.
Say I google: FDA's opinion on ISO-9001
Then it finds a link to a PDF on…
Norfeldt
- 266
3
votes
5 answers
Website crawler/spider to get site map
I need to retrieve a whole website map, in a format like :
http://example.org/
http://example.org/product/
http://example.org/service/
http://example.org/about/
http://example.org/product/viewproduct/
I need it to be linked-based (no file or dir…
ack__
- 117