Optimizing Dynamic Web Pages For Search Engines
Dynamically Generated Web Pages
The contents of dynamically generated web pages are often invisible to most search engine spiders. This is the reason that they hardly ever get indexed. You can get your dynamically generated web site listed within the search engine results, by making the content of your site visible to search engine spiders.
Generally, a dynamic web page is a template that displays distinct information in response to queries made by visitors. A major part of the page content comes from the database connected to the web site. Visitors find dynamically generated web pages very impressive, as they get instant access to the data they want. Also, these sites are easy to update. If there is a new product or some modifications to the price, the web master just has to edit the database, instead of editing hundreds of individual static web pages.
Usually, dynamic web pages are created using technologies like CGI, ASP, Cold Fusion etc. As we just mentioned, it does great from a user standpoint, but from a search engine stand, it can be difficult.
The Problems With The Dynamic Pages
The problem is that these dynamic web pages don’t actually exist until a visitor injects values in the variables. The search engine spiders do not select variables and insert values in them. That query can either be typed into a search form by the visitor, is already be coded into a link on the home page. This makes the link a pre-defined search of the site’s catalogue. Dynamic scripts often need certain information before they can return the page content e.g. cookie data, session id or a query string are common requirements.
But a search engine spider doesn’t know to use your search function, or what questions to ask. Spiders usually stop indexing a dynamic site, simply because they can’t answer the question.
Apart from this, the urls of these dynamically generated pages contain question marks (?) and also percentage signs (%) within them. The other symbols that are often used in the url of these pages include &, %, + and $. These urls are called “query string”. However, most spiders can not read any character that is beyond the question mark (?) symbol in a dynamic url. A sample url:
http://www.americanbooks.com/cgi-bin/items.cgi?name=naturaldiet
Most of the spiders will not read beyond the ? symbol in this url. The ? sign acts as a stop symbol for them. So, they will retrieve an url which looks like this:
http://www.americanbooks.com/cgi-bin/items.cgi
As this is not an actual page, nothing gets indexed. So if your web site or parts of it are dynamically generated, you have to implement certain changes in order to make your content easily accessible to the spiders.
Some search engines also avoid indexing urls of the static pages, which seems to be within the cgi-bin, such as:
http://www.americanbooks.com/cgi-bin/items.html
http://www.americanbooks.com/cgi/items.html
Why Do Search Engines Not Read Beyond ‘?’ Symbol
The spiders prefer not to read within the cgi-bin directory, or the characters of url which contain a ‘?’ symbol, as there is a chance that the spider can get into a situation where cgi supplies an ‘infinite’ number of urls. The spider will then keep crawling those pages and fall into an infinite loop, from which it will not be able to come out. These are called spider traps. The database programs may also create a similar situation for the spider.
So, in order to avoid possible traps, the spiders do not read the characters of an url beyond ? symbol.
Getting a spider trapped inside your server is bad not just for the spider, but repeated requests for pages can crash the server.
But a search engine spider doesn’t know to use your search function, or what questions to ask. Spiders usually stop indexing a dynamic site, simply because they can’t answer the question.
Apart from this, the urls of these dynamically generated pages contain question marks (?) and also percentage signs (%) within them. The other symbols that are often used in the url of these pages include &, %, + and $. These urls are called “query string”. However, most spiders can not read any character that is beyond the question mark (?) symbol in a dynamic url. A sample url:
http://www.americanbooks.com/cgi-bin/items.cgi?name=naturaldiet
Most of the spiders will not read beyond the ? symbol in this url. The ? sign acts as a stop symbol for them. So, they will retrieve an url which looks like this:
http://www.americanbooks.com/cgi-bin/items.cgi
As this is not an actual page, nothing gets indexed. So if your web site or parts of it are dynamically generated, you have to implement certain changes in order to make your content easily accessible to the spiders.
Some search engines also avoid indexing urls of the static pages, which seems to be within the cgi-bin, such as:
http://www.americanbooks.com/cgi-bin/items.html
http://www.americanbooks.com/cgi/items.html
Dynamic Page Optimization Solutions
http://www.americanbooks.com/cgi-bin/items.cgi?name=naturaldiet
The above url says that americanbooks.com has got some content on natural diets. But this content can not be indexed in the search engines, as the spiders can not read them. If a competitor of americanbooks.com has a static page on natural diet, he can then get it listed in the search engines. He will then get all the traffic which has been searching with the keyword “natural diet”. And, despite the fact that americanbooks has got similar or more useful content, it cannot get listed in the search engine result. It will then lose the sales to it’s competitors.
So if he can change the urls of his dynamic pages, so that the urls can get rid of the ?,= and similar symbols, and look like a static url, he can promote the sales of his company.
Solutions For Dynamic Pages
CGI/ Perl
If you have used CGI or Perl in your web site, the solution is a script that picks up all the characters before the query string and the balance of the characters are equaled to a variable. This variable can then be used in your url.
Path_Info (or Script_Name) is a variable in a dynamic application that contains the complete URL address and includes the query string information. So, the script will extract characters before the query string from the path_info variable and substitute the balance of the characters to a variable.
However, the major search engines have no problems indexing pages that are built in part with SSI content. This makes no difference if the pages end in the .shtml extension that some people use. However, there may be a problem if the pages use the cgi-bin path in their URLs.
ASP
What will you do if you have used ASP?
Active server pages are used within Microsoft-based web servers. The web pages, which use ASP usually have a .asp extension. It is arguably the most widely used scripting style for large Internet sites. Most of the major search engines will index these pages if you avoid using the ? symbol.
Exception digital enterprise solutions has created a solution for this, called xqasp, an add-on application that allows the “?” within URL’s to be converted to “/” by the web server. Further information on the usage and implementation of this product is available at:
http://www.xde.net/products/product_xqasp.htm
This is a little costly option, priced at around $250, but it is worth the last cent that you pay for it.
Some low cost solutions are:
ASPSpiderBait
http://www.webanalyst.com.au/Products/ASPSpiderBait.htm
A shareware product that enables dynamic content in sites using Active Server Pages to be included in the major search engine indexes by converting the PATH_INFO part of an HTTP header.
PortalPageFilter
http://www.alphasierrapapa.com/products/portalpagefilter/
It removes the ? symbols from ASP services. But it can be a barrier to some search engine spiders.
Cold Fusion
If you have used Coldfusion in your website, you’ll need to reconfigure it on your server. The web pages developed using Cold Fusion usually end with .cfm extensions. Normally, the database will use a ? symbol to retrieve pages. There are workarounds to this that will make your pages accessible. Reconfigure your Cold Fusion setup to replace the “?” in a query string with a ‘/’ and pass the value to the URL. The browser interprets that as a static URL page.
Instead of http://www.americanbooks.com/items.cfm?item_id=11667, you get a string like this: http://www.americanbooks.com/items.cfm/11667.
So when the search engine spider comes to index the page, it will not encounter the “?”, and go ahead and index the complete dynamic page.
Apache Server
Apache is a popular web server software. It has a rewrite module that enables you to turn URL’s containing query strings into URL’s that search engines can index. This module, called mod_rewrite, isn’t installed with Apache software by default, so you should check with your web hosting company and see if it’s available for your server.
It takes an url which initially looked like this:
http://www.americanbooks.com/items.htm?cat=natural_diet,
and makes it available in this format:
http://www.americanbooks.com/natural_diet/index.htm
Further information on this module is available at:
http://httpd.apache.org/docs/mod/mod_rewrite.html
Directly submit dynamic pages through paid inclusion programs
Directly submitting specific dynamic web pages to AltaVista increases the chances of getting picked up by that search engine. If you submit your dynamic pages to AltaVista or Inktomi via their paid inclusion programs, it definitely ensures they will get in.
File Extensions
These days it does not matter to the search engines how your files end. Even if your pages don’t end in .html or .htm, they’ll probably still get indexed if you’ve manage the ? symbol. However, Northern Light is more flexible and can index any page with .html, .htm, .shtml, .stm, .asp, .phtml, .cfm, .php3, .php, .jsp, .jhtml, .asc, .text, and .txt extensions. But it will not index pages ending in .cgi. It will inform you if any other extensions are invalid during the submission steps.
Finale
Ask yourself, if you really need to generate dynamic web pages. Often, the database is used as a page creation tool. You can use it to create static pages, especially for sections of your site that do not change frequently. Also, consider creating mirror static web pages of your dynamic content, so that the search engines can spider them.
Most engines did not used to index dynamically generated pages even some months ago, and this is still true. That is because a dynamically generated page can cause the spider to get trapped in an infinite loop, from which it can not come out. But now, situations are changing slowly. The first search engine to index dynamic web pages is Google. It began indexing dynamic pages ( including the question mark) during the end of 2000. You can also submit a dynamically generated page to HotBot and some others.
In these search engines, the search engines will not follow links from a dynamically generated page to ensure that their spiders do not get stuck in a loop. So the advice is to spend a little bit of your time on your dynamically generated pages and make sure each of them get indexed by the search engines. It will be a great investment of your time – and it will also result in more traffic for your web site.
Author:
Search Engine Ethics