written by: Daniel Brecht•edited by: Brian Nelson•updated: 10/26/2011
Are you looking for a tool that can crawl the Web to search, find, and fetch Web pages? If so, why not consider Googlebot? In this article, you will learn more about this Google user-agent as well as understand what functions and options it has for users.
slide 1 of 6
Googlebot is used to search the Internet. It uses Web crawling software by Google, which allows them to scan, find, add and index new web pages. In other words, "Googlebot is the name of the search engine spider for Google. Googlebot will visit sites which have been submitted to the index every once in a while to update its index."
Note: Googlebot only follows HREF "Hypertext Reference" links--which indicates the URL being linked to--and SRC "Source" links. With a list of webpage URLs, Googlebot will use Web-crawling robots to collect information to build a searchable index for Google's Indexer.
slide 2 of 6
The function of Googlebot
Googlebot functions as a search bot to crawl content on a site and interpret the contents of a user's created robots.txt file (e.g., www.myhost.com/robots.txt). The searchable bots (robots) work by reading Web pages; then, they make the content of the pages available to all Google services (done by Google's caching proxy).
Note: Googlebot's requests to Web servers are done by a user-agent string containing "Googlebot," and requests to a host address contain "googlebot.com."
Benefits: It is used to find Web pages on the Internet. Search robots will access any file in the root directory and all its subdirectories. Of course, users can set it up to allow or disallow the robots.txt file to Control Search Engine Spiders--a program that travels the Web--to be able to retrieve every page from a Web site.
(1) Users must go to the Webmaster Tools Home page and click the site they want.
(2) Under Site configuration, click Crawler access.
(3) Click the Generate robots.txt tab to allow robot access, or in in the Action list, select "Disallow" to block Googlebot from all files and directories on your site.
(4) In the Files or directories box, type /. Click Add. This will allow your robots.txt file to be automatically generated.
(5) Save your robots.txt file (Note: It must reside in the root of the domain and must be named "robots.txt".)
To ensure the robots.txt tool is working properly, test it! Here's how"
(1) Go to the Webmaster Tools Home page and click the site you want.
(2) Under Site configuration, click Crawler access. If it's not already selected, click the Test robots.txt tab.
(3) Copy the content of their robots.txt file and paste it into the first box. In the URLs box, list the site to test it against.
Creating the robots.txt file and saving tips: Be sure in the Robot list to click Googlebot. And, in the User-agents list, be sure to select the user-agents you want. "To save any changes, you'll need to copy the contents and paste them into your robots.txt file."
"Writing a robots.txt file is, as you have seen, a relatively simple matter. However it is important to bear in mind that it is not a security method. It may stop your specified pages from appearing in search engines, but it will not make them unavailable. There are many hundreds of bots and spiders crawling the Internet now and while most will respect your robot.txt file, some will not and there are even some designed specifically to visit the very pages you are specifying as being out of bounds."
Note: Through Googlebot, users can check out their own Web site as seen by Google. See how it works by clicking on this link: View a Web Page as 'Googlebot' .
slide 4 of 6
Pros and Cons of Googlebot
- It can quickly build a list of links that come from the Web.
- It recrawls popular frequently-changing web pages to keep the index current.
- It only follows HREFlinks and SRC links.
- It takes up an enormous amount of bandwidth.
- Some pages may take longer to find, so crawling may occur once a month vice daily.
- It must be setup/programmed to function properly.