What is a Googlebot and How Does it Work?

Quick Take

Are you looking for a tool that can crawl the Web to search, find, and fetch Web pages? If so, why not consider Googlebot? In this article, you will learn more about this Google user-agent as well as understand what functions and options it has for users.

On this page

Defining Googlebot

Googlebot is used to search the Internet. It uses Web crawling software by Google, which allows them to scan, find, add and index new web pages. In other words, “Googlebot is the name of the search engine spider for Google. Googlebot will visit sites which have been submitted to the index every once in a while to update its index.”

Note: Googlebot only follows HREF “Hypertext Reference” links–which indicates the URL being linked to–and SRC “Source” links. With a list of webpage URLs, Googlebot will use Web-crawling robots to collect information to build a searchable index for Google’s Indexer.

The function of Googlebot

Googlebot functions as a search bot to crawl content on a site and interpret the contents of a user’s created robots.txt file (e.g., www.myhost.com/robots.txt) . The searchable bots (robots) work by reading Web pages; then, they make the content of the pages available to all Google services (done by Google’s caching proxy).

Note: Googlebot’s requests to Web servers are done by a user-agent string containing “Googlebot,” and requests to a host address contain “googlebot.com.”

Benefits: It is used to find Web pages on the Internet. Search robots will access any file in the root directory and all its subdirectories. Of course, users can set it up to allow or disallow the robots.txt file to Control Search Engine Spiders–a program that travels the Web–to be able to retrieve every page from a Web site.

How to use Googlebot

Current version: Googlebot 2.1

Tag: Googlebot/2.1 (+https://www.googlebot.com/bot.html)

Switching User-Agent to Googlebot: FireFox extension (User-agent switcher)

Verifying Googlebot

IP address range:

from 66.249.64.0 to 66.249.95.255 (googlebot.com)
(as of May 2008)

Tips: For Googlebot to function entirely, allow the bots (spiders) to have all the access they want/need.

Reminders: Ensure the Prevent Spiders option is set to true in your admin sessions settings.

Updates/changes to Googlebot: check the .txt file (such as “robots.txt”) for content.

How to Allow/Disallow Googlebot (manually):

To Allow Googlebot

User-agent: Googlebot
Allow: / (or list a directory or page that you want to allow)

To Block Googlebot

User-agent: Googlebot
Disallow: / (or list a directory or page that you want to disallow)

How to create a robots.txt file using the Generate robots.txt tool (in 5 steps):

(1) Users must go to the Webmaster Tools Home page and click the site they want.

(2) Under Site configuration, click Crawler access.

(3) Click the Generate robots.txt tab to allow robot access, or in in the Action list, select “Disallow” to block Googlebot from all files and directories on your site.

(4) In the Files or directories box, type /. Click Add. This will allow your robots.txt file to be automatically generated.

(5) Save your robots.txt file (Note: It must reside in the root of the domain and must be named “robots.txt”.)

To ensure the robots.txt tool is working properly, test it! Here’s how"

(1) Go to the Webmaster Tools Home page and click the site you want.

(2) Under Site configuration, click Crawler access. If it’s not already selected, click the Test robots.txt tab.

(3) Copy the content of their robots.txt file and paste it into the first box. In the URLs box, list the site to test it against.

Creating the robots.txt file and saving tips: Be sure in the Robot list to click Googlebot. And, in the User-agents list, be sure to select the user-agents you want. “To save any changes, you’ll need to copy the contents and paste them into your robots.txt file.”

“Writing a robots.txt file is, as you have seen, a relatively simple matter. However it is important to bear in mind that it is not a security method. It may stop your specified pages from appearing in search engines, but it will not make them unavailable. There are many hundreds of bots and spiders crawling the Internet now and while most will respect your robot.txt file, some will not and there are even some designed specifically to visit the very pages you are specifying as being out of bounds.”

Note: Through Googlebot, users can check out their own Web site as seen by Google. See how it works by clicking on this link: View a Web Page as ‘Googlebot’ .

Pros and Cons of Googlebot

Pros:

- It can quickly build a list of links that come from the Web.

- It recrawls popular frequently-changing web pages to keep the index current.

Cons:

- It only follows HREFlinks and SRC links.

- It takes up an enormous amount of bandwidth.

- Some pages may take longer to find, so crawling may occur once a month vice daily.

- It must be setup/programmed to function properly.

Other Googlebot Options

Googlebot-Mobile

- crawls pages for Google’s mobile index

Googlebot-Image

- crawls pages for Google’s image index

Mediapartners-Google

- crawls pages for AdSense content/ads

Adsbot-Google

- crawls pages to check for Google AdWords

Read More About It On Bright Hub:

How the Google Search Engine Works
Optimize Your Website With Google Webmaster Tools
How To Exclude Googlebot From Indexing Your Website

Image: Tech-FAQ - Googlebot Image

What is a Googlebot and How Does it Work?

Defining Googlebot

The function of Googlebot

How to use Googlebot

Pros and Cons of Googlebot

Other Googlebot Options

Related Articles

References

More from Tech