2

I notice that in some cases paywalled news articles seem to have been indexed by Google because excerpts from the story appears in the search hit.

However, when I go to these web sites using a Googlebot (robot) identity the information is not there to crawl the article. This would seem to suggest that the publisher is somehow submitting their paywalled articles (and associated URLs) to Google and not having them crawled. Obviously such a submission would be non-trivial because it would have to have both the content of the article and various metadata concerning it, such as the URL where it is located and its expiration date.

Does such a mechanism exist? If so, can an ordinary webmaster such as myself, use it?

harrymc
  • 498,455
Tyler Durden
  • 6,333

2 Answers2

1

Yes, it is possible

Google has a page called Get your content on Google, which, as of today, 21 May 2018, is a comprehensive reference for how to get your contents indexed by Google. There are various links on it which you might want to try, including:

  • Add your URL
  • App crawling
  • Search Console
  • Search Engine Optimization (SEO) Starter Guide

This answer has been posted by @acejavelin two years and one month ago as a comment. Perhaps the page to which we linked was not as comprehensive as it is today, or else I don't see why he/she didn't post it as a full answer. Also, I see the OP deeming this page "too meta" at the time, but today, it is exactly what he/she wants.

Websites can detect bogus Googlebots

Websites sometimes prevent their web contents from being crawled by web browsers that use bogus Googlebot user agent strings. You can find more information about this subject in the Panopticlick website of the Electronic Frontier Foundation. But to put it short, Googlebot has a other features of identification than just a user agent.

0

The fact that the company's webserver has returned the infamous HTTP error 404 to a URL does not mean that the resource does not exist. It only means that the webserver has decided that for you this resource does not exist.

The webserver can identify you as a paying customer by many methods, chief among them is an identifying HTTP cookie stored in your browser. When the cookie is not found, the webserver will usually ask you to login, and if successful will then return that cookie.

The question is then why is Googlebot allowed access, but you are not ?

Googlebot will eventually discover almost any website, but the webmaster can request an early visit by using the tools contained in Get your content on Google. He can also direct the bot to certain folders by using a Robots.txt file.

An example of such a file is :

User-agent: googlebot
User-agent: google
User-agent: bingbot
User-agent: bing
Disallow: /bedven/bedrijf/
Crawl-delay: 10

User-agent: *
Disallow: /

The bot identifies itself by using in the header of the HTTP request a User agent tag, for example googlebot.

However, assuming the identity of Googlebot is not an easy matter. The website can easily verify the bot's identity by doing a reverse DNS lookup on the accessing IP address. The returned domain name must in that case be either googlebot.com or google.com, which is something that you yourself cannot fake.

If you fully control your webserver, for example via PHP, you can duplicate this mechanism and create what is called a "membership website". Such software is called Membership Software.

If you are not a PHP programmer, or are unwilling for such an investment of your time, there exist many open-source software alternatives, but also lots of commercial products that will compete for your business. Be very critical if you decide to choose one, and check it thoroughly on the web for reviews.

For more information see these resources that I found via a search (not necessarily the best ones, and some are quite commercial in nature, but they will get you started) :

harrymc
  • 498,455