How the Googlebot Sees and Ranks Different File Formats

Feb

2010

How the Googlebot Sees and Ranks Different File Formats

Author: john | Posted in: Google, Search Engine Optimization, Web Analytics

Something that can be very helpful when you are designing and refining your website is knowing what it “looks like” to the bots that crawl the web and index your pages. If your site doesn’t have the information that the bots need to know what your content and graphics are all about, then they can’t do a very good job indexing your pages.

If you use Firefox, you can download and install the “User Agent Switcher” option for Firefox. You’ll have to restart Firefox once you’ve installed it. Once you have it, in Firefox, go to Tools, then User Agent Switcher, then Options, then Options again. In the User Agent Switcher window that comes up, select User Agents and click on “Add.”

In the Description box, type something like “Google Bot” and in the User Agent box, paste this:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

In the App Name box type Googlebot, then click OK. Now, any time you want to view one of your pages as if you were the Google bot, you go to Tools, User Agent Switcher, Googlebot.

You might have to block cookies to view some sites, and you can do this in Tools, Options, Privacy, Exceptions (then add the URL).

Another thing you can do is to use a text browser like Lynx to get a rough estimate as to how your site looks to Google. Google Webmaster Tools, however, has a feature that can help too. On the Webmaster Tools dashboard, click on the “+” sign by the “Labs” link in the left hand column. When you do, you’ll see an option called “Fetch as Googlebot” as you can see in the first screen shot. Click on it, and it will download your site (or whatever URL you enter) as the Googlebot sees it.

As in the second screen shot, you’ll see the html source just like that you’ll see when you click on “View Source” in your browser. You’ll get a response code, like 200, which means everything is peachy, or 301, which means “permanent redirect.” You’ll see what kind of server your website is on and any CSS files or scripts that are called upon and included.

One caveat, however is that it doesn’t always work with PDF files, but Google insists it’s working on fixing the problem and if your sites look OK in your browser, chances are it looks OK to the Googlebot (even if it’s PDF).

If you run a lot of scripts or have lots of layers on your sites, this can be particularly handy. If your site is mostly simple html, your normal web browser will give you a pretty good idea of what Google sees on your site.

What Googlebot Sees as it crawls your site

When the Googlebot crawls your site, it uses computer algorithms to determine which sites to crawl, how often to crawl them, and how many pages to get from each site. It starts with a list of URLs from earlier crawls and with sitemap data. The bot notes changes to existing sites, new sites, and dead links for the Google index. When the Googlebot processes each page it takes in content tags and things like ALT attributes and title tags. Googlebot can process a lot of content types, but not all. It cannot process contents of some dynamic pages or rich media files.

There has been plenty of talk about how to handle Flash on your site. Googlebot doesn’t cope well with Flash content and links that are contained within Flash elements. Google has made no secret about its dislike for Flash content, saying that it is too user-unfriendly and doesn’t render on devices like PDAs and phones.

You do have some options, however, such as replacing Flash elements with something more accessible like CSS/DHTML. Web design using “progressive enhancement,” where the site’s designs are layered, yet concatenated, will allow all users including the search bots to access content and functions. Amazon has a “Create your Own Ring” tool for designing engagement rings that is a good example of this type of functionality. Also, something called sIFR, or Scalable Inman Flash Replacement is an image replacement technique that uses CSS, Flash, or JavaScript to display any font in existence, even if it isn’t on the user’s computer, as long as the user can display Flash. Now, sIFR is officially approved by Google.

Google says that the bottom line is to show your users and Googlebot the same thing. Otherwise your site could look suspicious to the search algorithms. This rule takes care of a lot of potential problems, like the use of JavaScript redirects, cloaking, doorway pages, and hidden text, which Google strongly dislikes.

Google support engineers say that Google looks at the content inside “noscript” tags, but they should accurately reflect the Flash-based content included in the noscript tags, or else Googlebot may think it’s cloaking.

According to Google engineer Matt Cutts, it’s difficult to pull text from a Flash file, but they can do a fair job of it. They use the Search Engine SDK tool that comes from Adobe / Macromedia. Most search engines are expected to make that the standard for pulling text out of Flash graphics. People who regularly use Flash might consider getting that tool as well and seeing for themselves what kind of text it pulls out of your graphics. In fact, Google may work with Adobe on updates to the tool.

Web Moves Blog

Web Moves News and Information

How the Googlebot Sees and Ranks Different File Formats

What Googlebot Sees as it crawls your site