Questions tagged [web-crawler]

77 questions
30
votes
5 answers

Convert web pages to one file for ebook

I want to download HTMLs (example: http://www.brpreiss.com/books/opus6/) and join it to one HTML or some other format that i can use on ebook reader. Sites with free books don't have standard paging, they're not blogs or forums, so don't know how to…
26
votes
2 answers

How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files. Google searches are completely useless. Here's a command I've tried: wget…
21
votes
1 answer

How to save all files/links from a telegram chat/channel?

I want to save ALL http(s) links and/or files, posted to some telegram chat (private or group) or channel (like mailing list). I need an analog of TumblOne (for tumblr) VkOpt (able to save chating history in vk.com ) or jDownloader (for file…
17
votes
4 answers

Using Wget to Recursively Crawl a Site and Download Images

How do you instruct wget to recursively crawl a website and only download certain types of images? I tried using this to crawl a site and only download Jpeg images: wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg…
Cerin
  • 9,652
14
votes
2 answers

Why is @ in email address sometimes written as [at] on webpages?

Why is @ sometimes in webpages written as [at]? Does it have any specific reason ?
Sai
  • 167
12
votes
4 answers

How "legal" is site-scraping using cURL?

Recently I was experimenting with the cURL, and I found lot is possible with it. I built a small script that crawls a musical site, which plays online songs. On the way of my experiment, I found that it is possible to crawl the song source also..…
7
votes
1 answer

wget: recursively retrieve urls from specific website

I'm trying to recursively retrieve all possible urls (internal page urls) from a website. Can you please help me out with wget? or is there any better alternative to achieve this? I do not want to download the any content from the website, but just…
abhiomkar
  • 171
6
votes
3 answers

Is it possible to discover all the files and sub-directories of a URL?

I wonder if there is a software I can use to discover all the files and sub-directories given a URL? For example, given www.some-website.com/some-directory/, I would like to find all the files in /some-directory/ directory as well as all…
Mark
  • 63
6
votes
4 answers

What do I use to download all PDFs from a website?

I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack,…
user385496
4
votes
2 answers

How can I scrape specific data from a website

I'm trying to scrape data from a website for research. The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few…
4
votes
1 answer

Extract data from an online atlas

There is an online atlas that I would like to extract values from. The atlas provides a tool ('Query') to extract values when you click a location or enclose a region on the map, or you can specify the latitude/longitude of a point where you want…
KAE
  • 1,919
4
votes
2 answers

Why website copy tools like Cyotek WebCopy and HTTrack cannot find files that search engines like Google can?

I would like to keep the target website private, but here are some details: It's a personal (as in single-author) public documentation / portfolio / blog sort of website It seems to be hosted using Apache The contents are static as far as I can…
Den
  • 143
4
votes
2 answers

Tool to recursivly convert a HMTL file to PDF?

Are there any tools which not only convert a HTML file to PDF but also follow links, so that in the end I get 1(!) PDF file which contains all html files?
user27076
  • 234
4
votes
1 answer

Finding pages on a webpage that contain a certain link

Google does a good jobs finding relevant information. Say I google: FDA's opinion on ISO-9001 Then it finds a link to a PDF on…
Norfeldt
  • 266
3
votes
5 answers

Website crawler/spider to get site map

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir…
ack__
  • 117
1
2 3 4 5 6