What is a Googlebot and How Does it Work?

What is a Googlebot and How Does it Work?
Page content

Defining Googlebot

Googlebot is used to search the Internet. It uses Web crawling software by Google, which allows them to scan, find, add and index new web pages. In other words, “Googlebot is the name of the search engine spider for Google. Googlebot will visit sites which have been submitted to the index every once in a while to update its index.”

Note: Googlebot only follows HREF “Hypertext Reference” links–which indicates the URL being linked to–and SRC “Source” links. With a list of webpage URLs, Googlebot will use Web-crawling robots to collect information to build a searchable index for Google’s Indexer.

The function of Googlebot

Googlebot functions as a search bot to crawl content on a site and interpret the contents of a user’s created robots.txt file (e.g., www.myhost.com/robots.txt). The searchable bots (robots) work by reading Web pages; then, they make the content of the pages available to all Google services (done by Google’s caching proxy).

Note: Googlebot’s requests to Web servers are done by a user-agent string containing “Googlebot,” and requests to a host address contain “googlebot.com.”

Benefits: It is used to find Web pages on the Internet. Search robots will access any file in the root directory and all its subdirectories. Of course, users can set it up to allow or disallow the robots.txt file to Control Search Engine Spiders–a program that travels the Web–to be able to retrieve every page from a Web site.

How to use Googlebot

Current version: Googlebot 2.1

Tag: Googlebot/2.1 (+https://www.googlebot.com/bot.html)

Switching User-Agent to Googlebot: FireFox extension (User-agent switcher)

Verifying Googlebot

IP address range:

  • from 66.249.64.0 to 66.249.95.255 (googlebot.com)
    (as of May 2008)

Tips: For Googlebot to function entirely, allow the bots (spiders) to have all the access they want/need.

Reminders: Ensure the Prevent Spiders option is set to true in your admin sessions settings.

Updates/changes to Googlebot: check the .txt file (such as “robots.txt”) for content.

How to Allow/Disallow Googlebot (manually):

  • To Allow Googlebot
  1. User-agent: Googlebot
  2. Allow: / (or list a directory or page that you want to allow)
  • To Block Googlebot
  1. User-agent: Googlebot
  2. Disallow: / (or list a directory or page that you want to disallow)

How to create a robots.txt file using the Generate robots.txt tool (in 5 steps):

(1) Users must go to the Webmaster Tools Home page and click the site they want.

(2) Under Site configuration, click Crawler access.

(3) Click the Generate robots.txt tab to allow robot access, or in in the Action list, select “Disallow” to block Googlebot from all files and directories on your site.

(4) In the Files or directories box, type /. Click Add. This will allow your robots.txt file to be automatically generated.

(5) Save your robots.txt file (Note: It must reside in the root of the domain and must be named “robots.txt”.)

To ensure the robots.txt tool is working properly, test it! Here’s how"

(1) Go to the Webmaster Tools Home page and click the site you want.

(2) Under Site configuration, click Crawler access. If it’s not already selected, click the Test robots.txt tab.

(3) Copy the content of their robots.txt file and paste it into the first box. In the URLs box, list the site to test it against.

Creating the robots.txt file and saving tips: Be sure in the Robot list to click Googlebot. And, in the User-agents list, be sure to select the user-agents you want. “To save any changes, you’ll need to copy the contents and paste them into your robots.txt file.”

“Writing a robots.txt file is, as you have seen, a relatively simple matter. However it is important to bear in mind that it is not a security method. It may stop your specified pages from appearing in search engines, but it will not make them unavailable. There are many hundreds of bots and spiders crawling the Internet now and while most will respect your robot.txt file, some will not and there are even some designed specifically to visit the very pages you are specifying as being out of bounds.”

Note: Through Googlebot, users can check out their own Web site as seen by Google. See how it works by clicking on this link: View a Web Page as ‘Googlebot’ .

Pros and Cons of Googlebot

Pros:

- It can quickly build a list of links that come from the Web.

- It recrawls popular frequently-changing web pages to keep the index current.

Cons:

- It only follows HREFlinks and SRC links.

- It takes up an enormous amount of bandwidth.

- Some pages may take longer to find, so crawling may occur once a month vice daily.

- It must be setup/programmed to function properly.

Other Googlebot Options

  • Googlebot-Mobile

- crawls pages for Google’s mobile index

  • Googlebot-Image

- crawls pages for Google’s image index

  • Mediapartners-Google

- crawls pages for AdSense content/ads

  • Adsbot-Google

- crawls pages to check for Google AdWords

Read More About It On Bright Hub:

Image: Tech-FAQ - Googlebot Image

References