Highest Voted 'web-crawler' Questions

30

votes

5 answers

Convert web pages to one file for ebook

I want to download HTMLs (example: http://www.brpreiss.com/books/opus6/) and join it to one HTML or some other format that i can use on ebook reader. Sites with free books don't have standard paging, they're not blogs or forums, so don't know how to…

ebook web-crawler

asked Mar 02 '11 at 08:30

Hrvoje Hudo

582

26

votes

2 answers

How to crawl using wget to download ONLY HTML files (ignore images, css, js)

Essentially, I want to crawl an entire site with Wget, but I need it to NEVER download other assets (e.g. imagery, CSS, JS, etc.). I only want the HTML files. Google searches are completely useless. Here's a command I've tried: wget…

wget web-crawler

asked Jan 31 '14 at 17:12

Nathan J.B.

723

21

votes

1 answer

How to save all files/links from a telegram chat/channel?

I want to save ALL http(s) links and/or files, posted to some telegram chat (private or group) or channel (like mailing list). I need an analog of TumblOne (for tumblr) VkOpt (able to save chating history in vk.com ) or jDownloader (for file…

download-manager web-crawler bulk telegram-messenger

asked Sep 29 '17 at 00:14

WallOfBytes

437

17

votes

4 answers

Using Wget to Recursively Crawl a Site and Download Images

How do you instruct wget to recursively crawl a website and only download certain types of images? I tried using this to crawl a site and only download Jpeg images: wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg…

bash web-crawler wget

asked Mar 29 '11 at 15:23

Cerin

9,652

14

votes

2 answers

Why is @ in email address sometimes written as [at] on webpages?

Why is @ sometimes in webpages written as [at]? Does it have any specific reason ?

web spam-prevention web-crawler

asked Nov 14 '13 at 16:38

Sai

167

12

votes

4 answers

How "legal" is site-scraping using cURL?

Recently I was experimenting with the cURL, and I found lot is possible with it. I built a small script that crawls a musical site, which plays online songs. On the way of my experiment, I found that it is possible to crawl the song source also..…

php curl screen-scraping web-crawler

asked Aug 23 '10 at 04:06

Chetan Sharma

487

7

votes

1 answer

wget: recursively retrieve urls from specific website

I'm trying to recursively retrieve all possible urls (internal page urls) from a website. Can you please help me out with wget? or is there any better alternative to achieve this? I do not want to download the any content from the website, but just…

web-crawler wget

asked Aug 29 '11 at 10:40

abhiomkar

171

6

votes

3 answers

Is it possible to discover all the files and sub-directories of a URL?

I wonder if there is a software I can use to discover all the files and sub-directories given a URL? For example, given www.some-website.com/some-directory/, I would like to find all the files in /some-directory/ directory as well as all…

web-crawler

asked Dec 10 '11 at 14:34

Mark

63

6

votes

4 answers

What do I use to download all PDFs from a website?

I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack,…

pdf web-crawler

asked Jul 07 '10 at 11:56

user385496

4

votes

2 answers

How can I scrape specific data from a website

I'm trying to scrape data from a website for research. The urls are nicely organized in an example.com/x format, with x as an ascending number and all of the pages are structured in the same way. I just need to grab certain headings and a few…

microsoft-excel wget web-crawler data-mining imacros

asked Sep 12 '12 at 15:47

Stoney

244

4

votes

1 answer

Extract data from an online atlas

There is an online atlas that I would like to extract values from. The atlas provides a tool ('Query') to extract values when you click a location or enclose a region on the map, or you can specify the latitude/longitude of a point where you want…

web-crawler screen-scraping

asked Aug 08 '12 at 12:52

KAE

1,919

4

votes

2 answers

Why website copy tools like Cyotek WebCopy and HTTrack cannot find files that search engines like Google can?

I would like to keep the target website private, but here are some details: It's a personal (as in single-author) public documentation / portfolio / blog sort of website It seems to be hosted using Apache The contents are static as far as I can…

web search-engines copy web-crawler

asked Jan 12 '23 at 16:12

Den

143

4

votes

2 answers

Tool to recursivly convert a HMTL file to PDF?

Are there any tools which not only convert a HTML file to PDF but also follow links, so that in the end I get 1(!) PDF file which contains all html files?

pdf conversion web-crawler

asked Feb 15 '10 at 20:13

user27076

234

4

votes

1 answer

Finding pages on a webpage that contain a certain link

Google does a good jobs finding relevant information. Say I google: FDA's opinion on ISO-9001 Then it finds a link to a PDF on…

pdf google-search web-crawler

asked Feb 02 '16 at 10:29

Norfeldt

266

3

votes

5 answers

Website crawler/spider to get site map

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir…

website wget web-crawler sitemap

asked Sep 03 '12 at 14:23

ack__

117

Questions tagged [web-crawler]