More On The Robots Exclusion Protocol

Nov

2003

More On The Robots Exclusion Protocol

Author: john | Posted in: Build Your Website

Robots Exclusion Protocol
Back in the spring of 2003, I wrote an article on the Robots exclusion protocol that generated a lot of emails and many questions. I am still getting them, so it seems that a more extended article is warranted. Often referred to as the ‘Robots.txt file’, the Robots exclusion protocol can be a very important part of your search engine optimization program, but it needs to be carefully implemented to be successful.

If used incorrectly, this small and ‘innocent looking’ text file can cause a lot of problems. It can even cause your site from being excluded from the search engine’s databases, it it’s not written correctly. In this extended article, I will show you how to correctly write it and make the Robots exclusion protocol an important part of your SEO efforts in attaining good visibility in the major search engines.

How The Robots Exclusion Protocol Works
Some of you may ask what is it and why do we need it? In a nutshell, as it’s name implies, the Robots exclusion protocol is used by Webmasters and site owners to prevent search engine crawlers (or spiders) from indexing certain parts of their Web sites. It could be for a number of reasons, such as sensitive corporate information, semi-confidential data, information that needs to stay private, or to prevent certain programs or scripts from being indexed, etc.

A search engine crawler or spider is a Web ‘robot’ and will normally follow the robots.txt file (Robots exclusion protocol) if it is present in the root directory of a Website. The robots.txt exclusion protocol was developed at the end of 1993 and still today remains the Internet’s standard for controlling how search engine spiders access a particular website.

If the robots.txt file can be used to prevent access to certain parts of a web site, if not correctly implemented, it can also prevent access to the whole site! On more than one occasion, I have found the robots exclusion protocol (Robots.txt file) to be the main culprit of why a site wasn’t listed in certain search engines. If it isn’t written correctly, it can cause all kinds of problems and, the worst part is, you will probably never find out about it just by looking at your actual HTML code.

When a client asks me to analyse a website that has been online for about a year and it still isn’t listed in certain engines, the first place I look is the robots.txt file. Once I have corrected it and written it for his website, and once I have optimized his most important keywords, usually the rankings will go up within the next thirty days or so.

How To Correctly Write The Robots.txt File
As the name implies, the ‘Disallow’ command in a robots.txt file instructs the search engine’s robots to “disallow reading”, but that certainly does not mean “disallow indexing”. In other words, a disallowed resource may be listed in a search engine’s index, even if the search engine follows the protocol. On the other hand, an allowed resource, such as many of the public (HTML) files of a website can be prevented from being indexed if the Robots.txt file isn’t carefully written for the search engines to understand.

The most obvious demonstration of this is the Google search engine. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file.

In so doing, it is not violating the robots.txt protocol, because it’s not reading any disallowed resources, it is simply reading other web sites’ links to those resources, which Google constantly uses for its page rank algorithm, among other things.

Contrary to popular belief, a website does not necessarily need to be ‘read’ by a robot in order to be indexed. To the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index, in practice, most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them from adding resources or disallowed files to their index.

Most modern search engines today interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index. Conversely, if it’s already in their index, placed there by previous crawling activity, they would normally remove it. This last point is important, and an example will illustrate that critical subject.

The inadequacies and limitations of the robots exclusion protocol are indicative of what sometimes could be a bigger problem. It is impossible to prevent any directly accessible resource on a site from being linked to by external sites, be they partner sites, affiliates, websites linked to competitors or, search engines.

Even with the robots.txt file, there is no legal or technical reason why they should be used, least of all by humans creating links, for which the standards were not written. In itself, this may not seem a bad idea, but there are many instances when a site owner would rather exclude a particular page from the Web. If such is the case, the robots.txt file will, to a certain degree help the site owner achieve his or her goals.

What Is Recommended
Since most websites normally change often and new content is constantly created or updated, it is strongly recommended that the Robots.txt file in your website be re-evaluated at least once a month. If necessary, it only takes a minute or two to edit this small file in order to make the changes required. Never assume that ‘it must be OK, so I don’t need to bother with it’. Take a few minutes and look at the way it’s written. Ask yourself these questions:

1. Did I add some sensitive files recently?
2. Are there new sections I don’t want indexed?
3. Is there a section I want indexed but isn’t?

As a rule of thumb, even before adding a file or a group of files that contain sensitive information that you don’t want to be indexed by the search engines, you should edit your Robots.txt file before uploading those files to your server. Make sure you place them in a separate directory. You could name it: private_files or private_content and add each of those directories to your Robots exclusion file to prevent the spiders from indexing any of those private directories.

Also, if you find that you have files in a separate directory but you want them indexed, if those public files have been on your server for more than a month and are still not indexed, have a look at your Robots.txt file to make certain there are no errors in any of it’s commands.

Examples Of A Properly Written Robots.txt File
In the following example, I will show you how to properly write or edit a Robots.txt file. First, never use a word processor to write or edit these files. Today’s modern word processors use special formatting and characters that will not be understood by any of the search robots and could lead to problems, or worse, it could cause them to ignore the Robots file completely.

Use a simple ‘pure vanilla text’ editor of the ASCII type or any text editor of the Unix variety. Personally, I always use the Notepad editor that comes on any Windows operating system. Make certain you save it as ‘robots.txt’ (all in lower case). Remember that most Web servers today run Unix and Linux and are all case sensitive.

Here is a carefully written Robots.txt file:

User-agent: Titan
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: EmailWolf
Disallow: /

User-agent: ExtractorPro
Disallow: /

The user-agent is the name of the robot you want to disallow. In this example, I have chosen to disallow Titan, EmailCollector, EmailSiphon, EmailWolf and ExtractorPro. Note that many of these robots are from spam organizations or companies that are attempting to collect email addresses from websites that will probably be used in spam. Those unwanted robots take up unnecessary Internet bandwidth and slows down your Web server in the process. (Now you know where and how they usually get your email address from). It is my experience that most of those email collectors usually obey the Robots.txt protocol.

Conclusion
Properly implementing the Robots exclusion protocol is both a simple process and takes very little time to enforce. When used as it is intended, it can ensure that the files you want indexed in your website will be indexed. It will also tell the search robots where they are not welcomed, so you can really concentrate in managing your online business in the safest way possible, away from ‘inquisitive minds’.

Author:
Serge Thibodeau of Rank For Sales

Web Moves Blog

Web Moves News and Information

More On The Robots Exclusion Protocol