How to crawl / index pages behind a login?

Question

Is it possible (are there any tools out there) to crawl pages (not content, just url) that's behind a login? We are looking to creating a new site, and need to index each page on the old site in order to capture all the content, content types, map all urls to the new site, etc... I have a login and I'm not looking to add this to google or anything.

Screaming Frog won't do it. And I can't involve the dev guys of the current site - so putting a script on the server won't work either. Any other way to do this?

score 0 · Accepted Answer · edited May 23 '17 at 12:06

Yes you can,Integrate your crawler with "SELENIUM".Provide login credentials and you can get your work done. Few good links that may help you:-

How to use Selenium with Python?

http://www.quora.com/Is-it-possible-to-write-a-Python-script-for-opening-a-browser-and-logging-into-a-website-How-could-you-do-it

https://selenium-python.readthedocs.org/en/latest/getting-started.html

It may take time and research but yes it will be done, take care of the Logout page while crawling.

score 0 · Answer 2 · answered May 18 '19 at 02:53

A good option which you can explore is using Scrapy. Its a python based framework to for extracting the data you need from websites. This will help you to remote login into a site and download the relevant data.

You can define and control the data you want to extract and how to process it. Also its much faster allowing to crawl and extract data from 16 page or more in parallel.

score 0 · Answer 3 · answered Oct 21 '20 at 07:57

Well, there is a workaround. You can use ExpertRec's custom search engine and set up a crawl behind login pages. Here's the blog with instructions: https://blog.expertrec.com/crawling-behind-login-authenticated-pages/

Though this is meant for building custom search engines, they have a free trial so you can set it up for free. And here's the workaround part. Once the crawl is complete, they let you export all the indexed URLs, and boom! there you have a list of all the pages that are behind the login.

How to crawl / index pages behind a login?

3 Answers3