Executive Overview
The creation an effective Web presence involves more than just creation of beautiful web pages. Like any business, the advertising of their existence to potential customers is vital to their success. The most basic form of advertising in the Web is the search engine. Search engines are the “Yellow Pages” of the Web and it is important that we are properly represented in them so that customers who are looking for us can find us.
This document is a summary of popular search engines and the techniques involved in creating a presence in them. It finishes with a proposal for the presentation of web content to the search engines.
How Search Engines Work
The term "search engine" is often used generically to describe both true search engines and directories. They are not the same. The difference is how listings are compiled.
Search Engines Vs. Directories
Search Engines: Search engines, such as HotBot, create their listings automatically. Search engines crawl the web, then people search through what they have found.
If you change your web pages, search engines eventually find these changes, and that can affect how you are listed. Page titles, body copy and other elements all play a role.
Directories: A directory such as Yahoo depends on humans for its listings. You submit a short description to the directory for your entire site, or editors write one for sites they review. A search looks for matches only in the descriptions submitted.
Changing your web pages has no effect on your listing. Things that are useful for improving a listing with a search engine have nothing to do with improving a listing in a directory. The only exception is that a good site, with good content, might be more likely to get reviewed than a poor site.
Hybrid Search Engines: Some search engines maintain an associated directory. Being included in a search engine's directory is usually a combination of luck and quality. Sometimes you can "submit" your site for review, but there is no guarantee that it will be included. Reviewers often keep an eye on sites submitted to announcement places, then choose to add those that look appealing.
The Parts Of A Search Engine
Search engines have three major elements. First is the spider, also called the crawler. The spider visits a web page, reads it, and then follows links to other pages within the site. This is what it means when someone refers to a site being "spidered" or "crawled." The spider returns to the site on a regular basis, such as every month or two, to look for changes.
Everything the spider finds goes into the second part of a search engine, the index. The index, sometimes called the catalog, is like a giant book containing a copy of every web page that the spider finds. If a web page changes, then this book is updated new information.
Search engine software is the third part of a search engine. This is the program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant.
Some search engines can have a tough time with web pages that use frames. Using frames either prevents them from finding pages within a web site because they are unable to follow frame links, or it causes them to send visitors into a site without the proper frame "context" being established. In most cases both problems can be corrected with judicious use of the noframes tag.
Directories
Web directories present users with a menu of broad subjects. Users select the subject category that covers their particular area of interest, and then select links to sub-menus closer to their topics until they find links to relevant resources.
Menu-driven Web directories, like Yahoo! and The Argus Clearinghouse, have the advantage over search engines that they are assembled by teams of editors who specialize in the available subjects. Where keyword searches can sometimes make the user sort through thousands of links, with a very low ratio of results close to what the user is actually looking for, Web directories can often point users to relevant information with just a few menu selections. Though there is no guarantee that every site linked to a Web directory will be accurate and relevant, the odds of finding irrelevant information in a Web directory is much lower than in the results of a Web search engine.
The distinction between Web directories and search engines is becoming weaker all the time, however. Yahoo!, one of the first directories on the Web, has always offered users the option of searching by keyword. Web search engines like Lycos and Infoseek, have now retro-fitted their sites to give users the choice to browse links by subject. Excite blurs the categories even further by offering topical searches and links to "related" pages with their search results.
Regionalism
Many major search engines maintain both a "world" index plus a "country" index, which provides greater regional coverage. Some of the same pages may be listed in both, but the country index may have greater depth.>
Search engines like Yahoo, LookSmart and the Open Directory that use human beings create regional listings through classification. Sites relevant to a particular country are listed within a country-specific edition of the directory.
Many of the major crawler-based search engines now have their spiders check pages for common words and markers of specific languages. When found a page is "tagged" internally as being from that language. When users search in a regional version of that search engine, they get only these pages
In domain filtering, a search engine's "world" index is filtered so that only sites from a particular country’s domain will appear
Any company with a world presence should spend effort to maintain a presence in at least some regions such as China and Germany, who have very strong regional directory services, however this may task may be left to overseas franchises.
Ranking
All search engines, including directories, score the relevancy of web pages through the use of a ranking algorithm. The purpose of this is to deliver links to web pages most relevant to each search phrase. Ranking is important if want our company to appear in the first page or two of any search.
When a searcher types in a phrase on a search engine and hits the "Search" button, the ranking algorithm jumps into action. Say, for example, that a surfer types in "signetics amplifiers" as their search phrase. The algorithm then looks in its database, searching for occurrences of the entire search phrase, or for occurrences of the individual key words "signetics", or "amplifiers" (extremely common words like "in" are usually ignored).
Each ranking algorithm assigns different weights to different occurrences of the key words, depending on where and in what form these matches are found. Taking all these factors into account, these algorithms generate relevancy scores. The scores are sorted and the corresponding web pages are listed in order with informative summary information from the database. We naturally want our name to appear early in the list of searches involving electronic components or in lists of electronic suppliers.
In the case of directories, human catalogers create the entries. They contain the descriptive information provided during the submission process (typically a title, description and category). They do not contain any of the elements of web pages that search engine spiders index, such as META tags, keyword density, etc., and the individual cataloger subjectively sets the rank. We don’t have a great deal of control in our placement within directories. The best that we can do is to write an appealing description and submit to as many directories, under as many categories as possible.
With search engines the decision to include or exclude a specific web page is made by the search engine’s spider. Every submitted web page is automatically accepted by the spider, provided it does not employ any obviously underhanded tricks to artificially boost its rating. Pages that use known rate boosting techniques are either excluded from the index entirely, or penalized later by the ranking algorithm. One common rate boosting technique that pages have been known to employ is the repeating of keywords both in the meta tag, in the title, or embedded in comments at the start of the body of the page. Another technique is to populate the empty spaces in the page with invisible keywords that are the same color as the page’s background. Because of this practice, pages can get penalized for having invisible text or titles over 20 words, even if none of the words in the title repeat.
Each search engine has its own rating algorithm and no single technique will raise a page in the search lists for all search engines. Making sure that keywords appear early in the web page’s text or title can help. Keywords that appear more than once in the document text can also help, but only as long as they don’t appear too often. The more often, other pages link to a page can be a factor. The proximity of search terms to one another within the early part of the page text can also be another factor. This means it’s best to actually use the keywords you choose in meta tag, in the first two paragraphs of the page text.
Meta tags are what many web designers mistakenly assume are the "secret" to propelling their web pages to the top of the rankings. HotBot and Infoseek do give a boost to pages with keywords in their meta tags. But Lycos doesn't read them at all, and there are plenty of examples where pages without meta tags still get highly ranked in common searches. AOL Search, Google, Lycos, and Northern Light are some of the search engines that do not recognize the meta tag.
An example of a well formed meta tag might look like this:
<HTML>
<HEAD>
<TITLE>Welcome to ElectronicsNow!</TITLE>
<META NAME = "Keywords" CONTENT="Semiconductors, IC, microprocessors,
style='margin-left: 50px;'>active components, microcontrollers, integrated circuit, design,
style='margin-left: 50px;'>analog, digital circuits, components, engineer, engineering,
style='margin-left: 50px;'>hardware, electronics, reference designs, wireless,
style='margin-left: 50px;'>telecommunications, telecom
<META NAME = "Description" CONTENT="ElectronicsNow!
style='margin-left: 50px;'>Where electronics engineers find, evaluate, and
style='margin-left: 50px;'>compare the latest in technology. We carry everything for
style='margin-left: 50px;'>the electronics engineer, from IC's to potato chips! Order online
style='margin-left: 50px;'>without leaving your workstation. Office delivery straight to you.”>
</HEAD>
Doorways And Page Cloaking
When a web site is spidered, every anchor and in some cases, every frame link is followed to every page in the site. Each page encountered is added to the catalogue database, which may not be a desirable outcome. Some pages may not be worth cataloging and will just clutter the search list by competing with the site’s main pages for search rank, pushing important pages down the search list. More important still, when pages such as job listings and articles are deleted, they will become dead links in the search databases. This can confuse and annoy potential customers.
One way to solve this problem is to create a doorway page. Doorway pages are designed just for search engine consumption. In a way they are like you site’s calling card. They contain the sites meta tags, title, and descriptive text. They also contain a redirect to the real site. Without anchors to follow, the search engines stop there leaving the rest of the site untouched, just cataloging the data left for them in the doorway page.
Unfortunately, in order to accomplish this the doorway page must contain a redirect. This can either be of the form of a meta redirect or a java script redirect. Because of spamming problems, some search engines will not accept pages with meta redirects and the use of java script may not be an acceptable alternative if you want to be able to support older browsers.
Another solution is to cloak the main page. Spiders are essentially modified browsers, each displaying it’s own unique browser type that identifies the spider and who sent it. When the site’s main page encounters a spider from a search engine on our list of recognized search engines, the page outputs the same information as the doorway page instead of the usual main page. This is functionally the same as a doorway page, but avoids the uncomfortable meta tag or java redirect.
Not all spiders will be on the list of recognized search engines. Because of this, the first page in the site will still need to display the keywords and description meta tags. To exclude unrecognized spiders from the rest of the site, the following meta tag will need to be added to the rest of the site’s pages:
<META NAME="ROBOTS" CONTENT="NOINDEX">
Once the site is protected from the attention of search engines, it then becomes possible to use more advance features such as frames and database driven active content without fear of confusing the search engines.
Directory Dependencies
It’s not practical and possibly even not desirable to try to maintain entries in all the search engines that are available on the Web. There are thousands of them world wide, but by signing up in a few key ones we can gain entries in many more.
Many search engines make use of other search engine databases to complement their own collections. Some are completely dependent having no databases of their own. For instance, MSN Search exclusively uses LookSmart search engine for its searches, instead providing extended search aids and capabilities to attract its customers. LookSmart itself depends on Open Directory for additional data when its own database falls short. Yahoo uses Google’s search engine to extend its own capabilities, sometimes to the exclusion of its own search engine. In many cases we can gain entries in other search engines by maintaining ourselves in their root providers.
The diagram below shows the top most popular 24 search engines as of June 1st, 2000. The databases shaded in dark gray show an absolute minimal database registration set. It should be possible to find our site in the other unshaded search engines through this set, although we would probably end up very low in the ranking since the tendency is for search engines to rank their data over that from outside databases. A more realistic set to maintain would include the light gray sites too.
These would be the ten basic search engines and would give any site good visibility in the top search engine sites.
The directory list is shorter, just seven entries: Ask Jeeves, Search AOL, Excite, LookSmart, Lycos, Snap, Yahoo. Since these entries are static, once you have signed up for these, you can leave them alone until you make changes to the structure of your site at which point you will again have to resubmit them. If you choose to set up doorways as described above, you will never have to resubmit them again.
Sign Up And Maintenance
Some search engines, like Inktomi, will not allow you to sign up. They acquire their entries from other channels. Inktomi gets its entries from its client sites. Other search engines will send your sign up submission not just to their search engine, but to others as well. Open Directory for instance, forwards submissions on to Google and FastSearch as well as it’s own directory.
Presenting A Web Presence
The most universally acceptable method from the viewpoint of the client for presenting an acceptable presentation to the world’s search engines is by far page cloaking. It imposes no limitations on your web page clients, such as requiring Java support or the use of possibly unacceptable meta tags. As opposed to doorway pages, it’s transparent too. Doorway pages can be visible to users under conditions of heavy web server load or through slow user connections. You may have seen doorway pages on web sites in the past. Usually they appear briefly as blank pages with a message saying something like “click here if the site doesn’t appear”. Cloaked pages tailor their content to the viewer. They output one page for search engine spiders and another for regular visitors. Visitors see the opening page of the site without delays.
Appendix A: Search Engine Inventory
The list below is a summary of the top 24 search engines based on popularity as of June 1st, 2000. These are probably the search engines and directories that you should concentrate your effort on for American content.
http://search.aol.com/
AOL Search allows its members to search across the web and AOL's own content from one place. The "external" version, listed above, does not list AOL content. The main listings for categories and web sites come from the Open Directory (see below). Inktomi (see below) also provides crawler-based results, as backup to the directory information. Before the launch of AOL Search in October 1999, the AOL search service was Excite-powered AOL NetFind.
http://www.altavista.com/
AltaVista is consistently one of the largest search engines on the web, in terms of pages indexed. Its comprehensive coverage and wide range of power searching commands makes it a particular favorite among researchers. It also offers a number of features designed to appeal to basic users, such as "Ask AltaVista" results, which come from Ask Jeeves (see below), and directory listings from the Open Directory and LookSmart. AltaVista opened in December 1995. It was owned by Digital, then run by Compaq (which purchased Digital in 1998), then spun off into a separate company, which is now controlled by CMGI. AltaVista also operates the Raging Search service, below.
http://www.askjeeves.com/
Ask Jeeves is a human-powered search service that aims to direct you to the exact page that answers your question. If it fails to find a match within its own database, then it will provide matching web pages from various search engines. The service went into beta in mid-April 1997 and opened fully on June 1, 1997. Some results from Ask Jeeves also appear within AltaVista.
http://www.directhit.com/
Direct Hit measures what people click on in the search results presented at its own site and at its partner sites, such as HotBot. Sites that get clicked on more than others rise higher in Direct Hit's rankings. Thus, the service dubs itself a "popularity engine." Aside from running its own web site, Direct Hit provides the main results, which appear at HotBot (see below) and is available as an option to searchers at MSN Search. Direct Hit is owned by Ask Jeeves (above). See the Using Direct Hit Results page to learn more about Direct Hit.
http://www.excite.com/
Excite is one of the more popular search services on the web. It offers a fairly large index and integrates non-web material such as company information and sports scores into its results, when appropriate. Excite was launched in late 1995. It grew quickly in prominence and consumed two of its competitors, Magellan in July 1996, and WebCrawler in November 1996. These continue to run as separate services.
http://www.alltheweb.com/
Formerly called All The Web, FAST Search aims to index the entire web. It was the first search engine to break the 200 million web page index milestone and consistently has one of the largest indexes of the web. The Norwegian company behind FAST Search also powers some of the results that appear at Lycos (see below). FAST Search launched in May 1999.
http://www.go.com/
Go is a portal site produced by Infoseek and Disney. It offers portal features such as personalization and free e-mail, plus the search capabilities of the former Infoseek search service, which has now been folded into Go. Searchers will find that Go consistently provides quality results in response to many general and broad searches, thanks to its ESP search algorithm. It also has an impressive human-compiled directory of web sites. Go officially launched in January 1999. It is not related to GoTo, below. The former Infoseek service launched in early 1995.
http://www.goto.com/
Unlike the other major search engines, GoTo sells its main listings. Companies can pay money to be placed higher in the search results, which GoTo feels improves relevancy. Non-paid results come from Inktomi. GoTo launched in 1997 and incorporated the former University of Colorado-based World Wide Web Worm. In February 1998, it shifted to its current pay-for-placement model and soon after replaced the WWW Worm with Inktomi for its non-paid listings. GoTo is not related to Go (Infoseek).
http://www.google.com/
Google is a search engine that makes heavy use of link popularity as a primary way to rank web sites. This can be especially helpful in finding good sites in response to general searches such as "cars" and "travel," because users across the web have in essence voted for good sites by linking to them. The system works so well that Google has gained wide-spread praise for its high relevancy. Google also has a huge index of the web and provides some results to Yahoo and Netscape Search.
http://www.hotbot.com/
HotBot is a favorite among researchers due to its many power searching features. In most cases, HotBot's first page of results comes from the Direct Hit service (see above), and then secondary results come from the Inktomi search engine, which is also used by other services. It gets its directory information from the Open Directory project (see below). HotBot launched in May 1996 as Wired Digital's entry into the search engine market. Lycos purchased Wired Digital in October 1998 and continues to run HotBot as a separate search service.
http://www.iwon.com
Backed by US television network CBS, iWon has a directory of web sites generated automatically by Inktomi, which also provides its more traditional crawler-based results. iWon gives away daily, weekly and monthly prizes in a marketing model unique among the major services. It launched in Fall 1999.
http://www.inktomi.com/
Originally, there was an Inktomi search engine at UC Berkeley. The creators then formed their own company with the same name and created a new Inktomi index, which was first used to power HotBot. Now the Inktomi index also powers several other services. All of them tap into the same index, though results may be slightly different. This is because Inktomi provides ways for its partners to use a common index yet distinguish themselves. There is no way to query the Inktomi index directly, as it is only made available through Inktomi's partners with whatever filters and ranking tweaks they may apply.
http://www.looksmart.com/
LookSmart is a human-compiled directory of web sites. In addition to being a stand-alone service, LookSmart provides directory results to MSN Search, Excite and many other partners. Inktomi provides LookSmart with search results when a search fails to find a match from among LookSmart's reviews. LookSmart launched independently in October 1996, was backed by Reader's Digest for about a year, and then company executives bought back control of the service.
http://www.lycos.com/
Lycos started out as a search engine, depending on listings that came from spidering the web. In April 1999, it shifted to a directory model similar to Yahoo. Its main listings come from the Open Directory project, and then secondary results come from the FAST Search engine. Some Direct Hit results are also used. In October 1998, Lycos acquired the competing HotBot search service, which continues to be run separately.
http://search.msn.com/
Microsoft's MSN Search service is a LookSmart-powered directory of web sites, with secondary results that come from Inktomi. RealNames and Direct Hit data is also made available. MSN Search also offers a unique way for Internet Explorer 5 users to save past searches.
http://search.netscape.com/
Netscape Search's results come primarily from the Open Directory and Netscape's own "Smart Browsing" database, which does an excellent job of listing "official" web sites. Secondary results come from Google. At the Netscape Netcenter portal site, other search engines are also featured.
http://www.northernlight.com/
Northern Light is another favorite search engine among researchers. It features a large index of the web, along with the ability to cluster documents by topic. Northern Light also has a set of "special collection" documents that are not readily accessible to search engine spiders. There are documents from thousands of sources, including newswires, magazines and databases. Searching these documents is free, but there is a charge of up to $4 to view them. There is no charge to view documents on the public web -- only for those within the special collection. Northern Light opened to general use in August 1997.
http://dmoz.org/
The Open Directory uses volunteer editors to catalog the web. Formerly known as NewHoo, it was launched in June 1998. It was acquired by Netscape in November 1998, and the company pledged that anyone would be able to use information from the directory through an open license arrangement. Netscape itself was the first licensee. Lycos and AOL Search also make heavy use of Open Directory data, while AltaVista and HotBot prominently feature Open Directory categories within their results pages.
http://www.raging.com/
Operated by AltaVista, Raging Search uses the same core index as AltaVista and virtually the same ranking algorithms. Why use it? AltaVista offers it for those who want fast search results, with no portal features getting in the way.
http://www.realnames.com/
The RealNames system is meant to be an easier-to-use alternative to the current web site addressing system. Those with RealNames-enabled browsers can enter a word like "Nike" to reach the Nike web site. To date, RealNames has had its biggest success through search engine partnerships. See the Using RealNames Links page for more information about RealNames.
http://www.snap.com/
Snap is a human-compiled directory of web sites, supplemented by search results from Inktomi. Like LookSmart, it aims to challenge Yahoo as the champion of categorizing the web. Snap launched in late 1997 and is backed by Cnet and NBC.
http://www.webcrawler.com/
WebCrawler has the smallest index of any major search engine on the web -- think of it as Excite Lite. The small index means WebCrawler is not the place to go when seeking obscure or unusual material. However, some people may feel that by having indexed fewer pages, WebCrawler provides less overwhelming results in response to general searches. WebCrawler opened to the public on April 20, 1994. It was started as a research project at the University of Washington. America Online purchased it in March 1995 and was the online service's preferred search engine until Nov. 1996. That was when Excite, a WebCrawler competitor, acquired the service. Excite continues to run WebCrawler as an independent search engine.
http://www.yahoo.com/
Yahoo is the web's most popular search service and has a well-deserved reputation for helping people find information easily. The secret to Yahoo's success is human beings. It is the largest human-compiled guide to the web, employing about 150 editors in an effort to categorize the web. Yahoo has over 1 million sites listed. Yahoo also supplements its results with those from Google (beginning in July 2000, when Google takes over from Inktomi). If a search fails to find a match within Yahoo's own listings, then matches from Google are displayed. Google matches also appear after all Yahoo matches have first been shown. Yahoo is the oldest major web site directory, having launched in late 1994.
http://www.webtop.com/
WebTop is a crawler-based search engine that claims an extremely large index. In addition to listing web pages, WebTop also provides information from news sources, company information and WAP-related content in its search results. The company also offers the WebCheck tool (formerly called k-check), which is an Alexa-like search and discovery tool. WebTop is backed by Bright Station, the company that acquired some search technology and other resources from the former Dialog Corporation. The Dialog search service itself is now owned by a different company, the Thomson Corporation.
Appendix B: Search Engine Agent And Host Names
| Search Engine | Agent Names | Host Names |
|---|---|---|
| AltaVista (normal spider) |
Scooter/2.0 G.R.A.B. X2.0 Scooter/1.0 scooter@pa.dec.com |
scooter.pa-x.dec.com scooter*.av.pa-x.dec.com such as: scooter3.av.pa-x.dec.com |
| AltaVista (instant spider) |
Scooter/1.0 | add-url.altavista.digital.com ww2.altavista.digital.com |
| Euroseek | Arachnoidea (arachnoidea@euroseek.com) | *.euroseek.net such as:infra.euroseek.net |
| Excite (mega spider) |
ArchitextSpider | crawl*.atext.com such as: crawl2.atext.com |
| Excite (fresh spider) |
ArchitextSpider | crimpshrine.atext.com |
| Fireball (German search engine) |
KIT-Fireball/2.0 | heavymetal.fireball.de style='color: black'> |
| Google (Experimental search engine) |
BackRub/2.1 backrub@google.stanford.edu http://google.stanford.edu/ | *.stanford.edu such as: hake.stanford.edu |
| Inktomi (powers HotBot, others) |
Slurp/2.0 (slurp@inktomi.com; http://www.inktomi.com/slurp.html) |
*.inktomi.com such as: j2001.inktomi.com or j10.inktomi.com |
| Infoseek (normal spider) |
InfoSeek Sidewinder/0.9 | *.infoseek.com such as: wilbur-bbn.infoseek.com style='font-size:10.0pt;color:navy'>or IP number such as: 204.162.98.90 |
| Infoseek (instant spider) |
Mozilla/3.01 (Win95; I) | as above |
| Lycos (regular spider) |
Lycos_Spider_(T-Rex) | lycosidae.lycos.com or *.pgh.lycos.com such as: spider3.srv.pgh.lycos.com |
| Lycos (Add URL spider) |
Lycos_Spider_(T-Rex) | *.sjc.lycos.com such as: sjc-fe4-1.sjc.lycos.com |
| Northern Light | Gulliver/1.2 | taz.northernlight.com |
| WebCrawler | Served by Excite spiders | Served by Excite spiders |