Ask us a question!

Web Moves Blog

Web Moves News and Information

05
Oct
2003

Insurance To Protect A Site Against A Potential Penalty

Introduction
With the Robots.txt protocol, a webmaster or web site owner can really protect himself if it is done correctly. Today, web domain names are certainly plentiful on the Internet. There exists a multitude of sites on just about any subject anybody can think of. Most sites offer good content that is of value to most people and can certainly help with just about any query. However, like in the real world, what you see is not always what you get.

There are a lot of sites out there that are spamming the engines. Spam is best defined as search engine results that have nothing to do with the keywords or key phrases that were used in the search. Enter any good SEO forum today and most spam topics in daily threads usually point to hidden text, keyword stuffing in the meta tags, doorway pages and cloaking issues. Thanks to newer and more powerful search engine algorithms, these domain networks that spam the engines are increasingly being penalized or banned all together.

The inherent risks of getting a web site banned on the basis of spam increases proportionately if it appears to have duplicate listings or duplicate content. Rank for $ales does not recommend machine-generated pages because such pages have a tendency of generating spam. Most of those so-called ‘page generators’ were not designed to be search engine-friendly and no attention was ever given to engines when they were designed.

One major drawback of these ‘machines’ is that once a page is ‘optimized’ for a single keyword or key phrase, first-level and at times second-level keywords tend to flood results with listings that will most assuredly look as 100% spam. Stay away from any of those so-called ‘automated page generators’. A good optimization process starts with content that is completely written by a human! That way, you can be certain that each page of your site will end up being absolutely unique.

How Do Search Engines Deal With Duplicate Content?
Modern crawler-based search engines now have sophisticated and powerful algorithms that were specifically designed to catch sites that are spamming the engines, especially the ones that make use of duplicate domains. To be sure, there are perfectly legitimate web sites whose situation can certainly be informative. However, and as the following example will clearly demonstrate, that is not always the case.

We will take this practical example of where there are actually three identical web sites, all owned and operated by the same company, where the use of duplicate content is evident. Google, Alta-Vista and most other crawler-based search engines have noticed and indexed all three domains. In this scenario, the right thing to do is to make use of individual IP addresses and implementing a server re-direct command (a 301 re-direct). An alternative to this would be to at least provide unique folders or sub-directories and using the Robots.txt exclusion protocol to disallow two of the three affected domains.

That way the search engines wouldn’t index the two duplicate sites. In such cases, the Robots.txt exclusion protocol should always be used. It is in fact your best ‘insurance’ against getting your site penalized or banned. In the above example, since that was not done we will look at duplicate content and assess where the risk of getting a penalty is the highest. We will list and describe the indexing of these three sites as being site one which is the main primary domain, site two and finally, site three.

The four major crawler-based engines that were analyzed were Google, Teoma, Fast and Alta-Vista. All three domain names point to the same IP address, which actually made it simpler to use Fast’s Internet Protocol filter to discover that there was really no more than three affected domains in this example. However, all three web sites are directed to the same IP address AND content folder! Such a scenario makes them in fact exact duplicates, raising all the duplicate content flags in all four engines analyzed.

Even if all three sites share the same Robots.txt file, the hosting arrangement and syntax in the Robots.txt file does nothing that is effective to help this duplicate content problem. Major spider-based search engines which rely a lot on hypertext to compute relevancy and importance as most do today, are best at discovering and dealing with sites that delve into duplicate content issues. As a direct result, a webmaster runs a large risk of having duplicate content in these engines because their algorithm makes it such a simple task to analyse, sort out and finally reject these duplicate content web sites.

If a ‘spam technician’ discovers duplicate listings, chances are very good they will take action against such offending sites. The chances actually increase when a person, often a competitor files a spam complaint or that a certain site is ‘spam-dexing’ the engines. To be sure, any page caused by duplicate content can improperly “populate” a search query. The end result is unfairly dominating most search results.

Marketing Analysis And PPC “Landing” Pages
In order to better analyse specific online marketing campaigns or surveys, some companies at certain times have in fact duplicate sites or operate PPC (Pay-per-Click) landing pages. It is important in such cases not to neglect to use the Robots.txt exclusion protocol to manage your duplicate sites. Disallow spiders from crawling duplicate sites by properly editing the right syntax in the Robots.txt file. Your index count will certainly decrease, but that is the right thing to do and you are actually performing the search engines a service. In such a case, a webmaster needs not to worry of impending penalties from the engines.

If these businesses or their marketing departments are in fact running marketing tests or surveys, there is usually more than just one domain that could potentially appear in the actual results pages of the engines. In such cases, I strongly recommend writing or re-writing all content all over and making certain that no real duplicate content gets to be indexed. One way to achieve that is to use some form of meta refresh tag or Java script solution to actually direct visitors to the most recent versions of pages while their webmasters get the Robots.txt exclusion protocol written correctly.

The Java script would effectively indicate where it is intended to redirect, assuring it can put the final document in its proper place. A ‘301 server redirect’ command is always the best thing to use in these cases and constitutes the best insurance against any penalties, as it will inform the search engines that the affected document (s) have in fact moved permanently.

Author:
Serge Thibodeau of Rank For Sales