The High & Low On Bayesian Spam Filters

Dec

2003

The High & Low On Bayesian Spam Filters

Author: john | Posted in: Search Industry News

Bayesian Spam Filter & Google
On November 26, Stephen Lynch, journalist at the New York Post picked up the phone and initiated a telephone interview with me about an article I wrote on the previous day. The article was in relationship to the current November Google update “dance”, dubbed “Florida”.

The following day, Mr. Lynch wrote this article and it was published in the New York Post, offering his comments and, without being technical, explaining some of the negative effects such an update can have on the average site owner or Webmaster.

As the latest “Florida” monthly Google update ‘dance’ has shown us, having a website highly-ranked on the Internet’s number one search engine, Google– if your search rankings precipitously drop as much as some did and without warning, it can spell a devastating blow to some online stores or certain commercial websites.

In the last 10 days, a lot of articles have also been written by some of my colleagues, some in the SEO field and some, like Seth Finkelstein who are more in favour of the free flow of information that the Internet can provide.

In this article, I will attempt to describe some of the spam-filtering techniques that Google is reported using during this Florida “dance”. This spam-filtering technology is based on the Bayesian algorithm.

The Inner-Workings of a Spam Filter for a Search Engine
For quite a long time now, Google’s search results have been under attack by search-engine spammers that continuously attempt to mask search results, in the end, cluttering the search engines with irrelevant information in their databases.

With the ever-growing popularity of Google and as it tries to handle more and more searching all over the Web, the temptation to foul the search results has become attractive to certain spammers, leading to substantial degradation in the quality and relevance of Google’s search results. Since Google is mostly concerned of quality search results that are relevant, it is now cracking down on these unscrupulous spammers, with new spam-filtering algorithms, using Bayesian filtering technology.

At the end of October 2003, Google deployed their new Bayesian anti-spamming algorithm, which appeared to have its search results crash when a previously identified spam site would have normally been displayed. In fact, the searching results were completely aborted when encountering such a spam-intended site. See “Google Spam Filtering Gone Bad” by Seth Finkelstein for more technical information on how this spam elimination algorithm works at Google.

The First Shoe That Fell
On or around November 5th, this spam problematic was in fact reduced significantly, resulting from the “kicking-in” of these new Bayesian anti-spam filters. Although not perfect, this new Bayesian spam-filtering technology seemed to have worked, albeit there were some crashes in some cases.

On or about November 15th 2003, Google, as it always does every month, started “dancing”, performing its needed monthly and extensive deep crawl of the Web, indexing more than 3.5 Billion pages. This update had some rather strange results, in a way reminding some observers of a previous major algorithm change done in April of 2003, dubbed update Dominick, where similar and very unpredictable results could be noted across the Web.

It was generally observed that, many ‘old’ and high-ranking sites, some of which were highly regarded as ‘authoritative’, which were certainly not spammers in any way, appeared to fall sharply in their rankings or would disappear entirely from Google’s search results.

Since then, there have been many explanations, some not too scientific, that attempted to answer this event that some have categorized as “serious”. For an example of some of the best of these explanations, read an article that Barry Lloyd wrote: “Been Gazumped by Google? Trying to make Sense of the “Florida” Update!”.

More on the Bayesian Spam Filter
Part of my research and the observations I have done in this matter point to the Bayesian spam filter that Google started to implement in late October. A “Bayesian spam filter” is a complex algorithm used in estimating the probability or the likelihood that certain content or material detected by Google is in fact spam. In its most basic format, the Bayesian spam filter determines if something “looks spammy” or if, on the other hand, it is relevant content that will truly help the user.

To a certain degree, the Bayesian algorithm has proven efficient in the war against spam in the search engines. Being ‘bombarded’ by spam as much as Google has been for the past couple of years, it has no choice but to implement such anti-spam safeguards to protect the quality and relevancy of its search results.

However, it is the general feeling in the SEO community that, unfortunately, the current Bayesian algorithm implementation seems to have extreme and unpredictable consequences that were practically impossible to be aware of beforehand.

On the outset, one of the problems with estimating the probability or likelihood that certain content does have spam in it is, given very huge datasets, such as the entire Internet for example, many “false success stories” can and will occur. It is exactly these false success stories that are at the centre of the current problem.

Since this whole event began to unwind, there are many people that have noted in tests and evaluations that, making the search more selective, differentiating such as trying to remove an irrelevant string tends to deactivate the new search results algorithm, which in turn effectively shuts down the newly-implemented Bayesian anti-spam solution at Google.

One More Observation
While we are still on the subject of the new filter, but getting away from the topic of spam-related issues, as a side note, while doing some testing with the new Florida update, I did notice that Google is now ‘stemming’. To my knowledge, it’s the first time that Google does offer such an important search feature. How does stemming works? Well, for example, if you search for ‘reliability testing in appliances’, Google would suggest you ‘reliable testing in appliances’.

To a certain degree, variants of your search terms will be highlighted in the snippet of text that Google provide each accompanying result with. The new stemming feature is something that will certainly help a lot of people with their searching for information. Again, Google tries to make its searches the most relevant they can be and this new stemming feature seems like a continuation on these efforts.

Conclusion
In retrospect, and in re-evaluating all the events that have happened on this major dance, it is clear that Google is still experimenting with its newly-implemented algorithm and that there are many important adjustments that will need to be done to it to make it more efficient.

Spam being a growing problem day by day, today’s modern search engines have no choice other than to implement better and more ‘intelligent’ spam-filtering algorithms that can make the difference between what is considered as spam and what isn’t.

The next 30 days can be viewed by some as being critical in the proper ‘fine-tuning’ and deployment of this new breed of application in the war against spam. How the major search engines do it will be crucial for some commercial websites or online storefronts that rely solely on their Google rankings for the bulk of their sales.

In light of all this, perhaps some companies in this position would be well advised in evaluating other alternatives such as PPC and paid inclusion marketing programs as complements. At any rate, it is my guess that search will continue to be an important and growing part of online marketing, both locally, nationally and on a global basis.

______________
References:

1) An anticensorware investigation by Seth Finkelstein
http://sethf.com/anticensorware/general/google-spam.php

2) Better Bayesian filtering by Paul Graham
http://www.paulgraham.com/better.html

Author:
Serge Thibodeau of Rank For Sales

Web Moves Blog

Web Moves News and Information

The High & Low On Bayesian Spam Filters