The Internet is a vast and overwhelming collection of information on
any subject that can be imagined. To provide structure to this huge amount
of information, search engines allow users to search for specific pieces of
Search engines such as Google and Yahoo are technically known as
information retrieval systems (IR) (Liddy, 2001). These search engines
then work on the basis of created indexes. These indexes are matched with
queries entered by users. Indexes are created according to words in
documents and pointers within documents. The IR system creating this index
is structured according to four elements: a document processor, query
processor, search and matching function, and ranking ability (Liddy, 2001).
The document processor comprises a preparing, processing and
inputting function when a search is conducted (Liddy, 2001). Several
functions are inherent in this process, including normalizing the document
stream, breaking it into retrievable units, metatagging subdocument pieces,
identifying indexable elements, etc. The first three functions are known
as pre-processing, and the main aim is standardization of multiple formats.
The nature and quality of search results are determined by the index
identification stage. Further concerning the quality of material is the
elimination of stop words. These include words of little meaning to the
content of the query, such as "and", "but", "of", etc. Deleting these
words helps to save search time and volume. Closely related is term
stemming, according to which suffixes are removed. This helps to reduce
the number of unique words in an index, and again saves storage space. A
disadvantage is that precision and accuracy of search results may be
negatively affected. There is however the option of a strong or weak
stemming algorithm in order to regulate precision. Finally, the document
processor extracts i...