How to Mine Information on the Web

How to Mine Information on the Web

Any serious discussion of how to use the web effectively will inevitably center around search tools. Quite simply, ones success in finding information on the web is closely related to ones understanding of the methods of searching. This article is a discussion of techniques for searching that we think will result in more successful searches.

Of course in a commercial vein, many readers of this article might be interested in how to get the search tools to find them. We can help with that, too! Just contact us.

Five classes of Search Tools

Search tools may be grouped into the five categories that follow. Of these, the two most commonly used are the search engine, and the subject directory class. In the past these methodologies tended to be separate and distinct, but as the web has evolved, some of the directory engines are beginning to supplement their offerings with search engine results, and visa versa.

Search Engines are large databases built by robots searching out web pages and then indexing the words on those documents. They index text from the sites visited, sometimes crawling to pages deep into the site (a practice known as deep crawling). The user may search then engine by keyword, but there is no browsing or subject categories. Size of search database is quite variable, depending on the engine. Each has specific characteristics, both in terms of how it searches, and how it ranks and indexes sites. Examples are Google (our favorite), Alta Vista, Northern Light, Infoseek, Fast Search, Hotbot, Lycos, Excite

Subject directories are categorizations of web pages built and evaluated by people employed by the search directory company. The user is able to browse by subject, and usually can 'drill down' to more specific information. They are hierarchical collections organized and sometimes annotated by editors. They do not provide full search capabilities, so the subject of the site is their only criterion. Examples are Yahoo!, Librarians' Index, Infomine, Britannica's Internet Guide, Galaxy, Scout Report Signpost, Looksmart, Infoseek, Excite. Most university libraries maintain subject directories.

Directories of searchable databases are also organized into subject categories by humans. They are typically collections of specialized databases which store elements of the 'invisible' web (not normally indexed by search engines). They are sources of information for in-depth studies of data not typically available from the traditional channels of search. Examples are Search.com, Beaucoup!, and Mama.

Directories of Gateway Pages are collections of web pages dedicated to a specific subject, usually compiled by an expert in that field. There are web pages which contain hyperlinks to other sources on a given subject See WWW Virtual Library, Guide to Guides, Argus Clearinghouse.

MetaSearch Engines are search engines of search engines. Usually results are not as targeted as a well planned search on the appropriate search tool. The make fast but superficial searches which are returned in a convenient format. However, they only return about 10% of the results from the engines they source. May be useful as a first pass. Examples include Metacrawler, Inference Find, Dogpile, and Metafind.