Searching, Sleuthing and Sifting
next? (Search Engines and Web Indexes)
is a search engine?
| What's a 'bot? | What's
the index for? | How
does a search engine list web sites?
| What's the best search engine?
| What do I use? | What
are ways to make my search more effective?
| What are the most
popular search utilities? | Specialized
Search Engines | For more information
In this lesson we will discuss how search
engines work in general terms, not all possible scenarios (or search algorithms!).
is a search engine really and how does it work?
What we think of as a search engine is really
a team effort. There are 3 "members" of the team -- a mechanism that
identifies web pages to be included in the database, a mechanism
that indexes the sites and a searching mechanism with an interface, which
scans, for keywords within the index. Users search the index (and hence,
the database or web documents) through a query box or a template.
Documents in which the search terms occur are presented as "hits."
Although some facilities incorporate "natural
language" searching (searching by asking a question "Where are the doughnuts?"),
most search tools retrieve "hits" or "matches" by seeking occurrences of
your search terms within its database and by attempting to match the terms
(converted to a "string" of data bits) against its index. Because the terms
are converted to a digital string, the search engine must somehow be instructed
to include plurals and alternate forms of a term
although some search
tools automatically include plurals, many do not. If you are interested
in "dogs," search for "dog or dogs" or use a wildcardsuch
as * (a wildcard is a typed symbol which simply
means "put any character here"). Some search engines allow "stemming."
This involves using a special character symbol which simply means
"put any ending here after this point". An example: the term comput&
(where &=stemming symbol) would bring up hits from the following words:
computers, computing, computation etc.
A 'bot, otherwise known as an intelligent agent,
spider, crawler, robot, or worm, is an automated device (software) which
may be programmed to search for terms (data "strings") matching certain
criteria. In terms of web search engines, a 'bot identifies and notes the
url's of web pages to be included in the database. Later, another 'bot
comes along and works on the interiors of the web documents, recording
occurrences of words and their position within the text. This information
is used to create a huge index. 'Bots travel along the links of a web site,
that is, they crawl or traverse from one hypertext link to another.
the index for?
The index is how the search engine locates the
url's which match your request. The web documents containing the query
keywords are presented as a listing which may include a brief summary of
the site. A simple way to understand the index is to think of it as a computerized
book index. To discover where a topic occurs in a book, we would look up
the word in the index which would indicate the page number(s) where the
term occurs. Now imagine that every single word is included in the book
index. A computerized version might be represented like this:
Number of times
keyword occurs in book
in book of keyword
||title page, page 1: first paragraph
word #5, page 2: first paragraph word #20, second paragraph word
#15, page 5: 2nd paragraph word 21,...etc. etc., in summary
||title page, table of contents,
pages: 1,2,5 etc.,12,25.
||table of contents, page 3, first paragraph
word #3, page 17; first paragraph word #30, page 21 etc.
||table of contents,
pages 3,17,21 etc.
page 50, 2nd paragraph, word #18,
page 52, 1st paragraph word #41, page 53, 1st paragraph word
||pages 50, 52, 53
Some immediate observations
A search engine uses its index to retrieve web
documents in which your search terms occur. The index lists the term and
where it occurs (the url or address of the web page) much like a book index.
a) the word apple occurs a lot in the database
b) the word apple occurs in the title
c) the words apple and orange occur in the table
d) the word grape does not occur in the title
or table of contents.
search engine returns hits
only from its own database, that is,
web pages which it has indexed. So if the site you are looking for has
not yet been indexed, it won't be in the results listing no matter how
magnificent your search strategy or statement.
does a search engine decide how to list web sites matching my search terms?
Each search engine uses a different algorithm
or method to calculate something called a "relevance" which it "ranks."
Have you ever noticed the numbers which sometimes appear next to the url's
in a listing of search results? This is the "relevance ranking." Relevance
means the probability that the "hit" or "match" is on-target with your
query. The creators of search engines change the way they calculate
relevance and do not tell us mere users their methodology; being high in
the major search engines' rankings on a topic means big business.
Sometimes Web site owners try to skew the
odds of appearing on the first page of "results" for folks searching specific
keywords. Being on the first page or in the top results increases the likelihood
that the site will be seen and hence selected by the user. Unscrupulous
folks "spam" the search engine to try to improve their rankings (and hence,
their Web-based business) in a variety of methods including using "invisible
text" (where text is colored the same as a background) or repeatedly using
keywords in "meta-tags" (descriptive information not usually seen by the
user unless when viewing the "page source" -- seel below).
Perhaps most unsettling is the rising trend
of some search engines which in effect, sell higher ratings to companies
willing to pay for the priviledge. Most users will be unaware that the
set of search results has in effect been manipulated to boost these companies
Exactly how relevance
is calculated is protected, proprietary information but it is important
to be aware that search engine providers may have alliances or agreements
with other businesses (reciprocal and/or financial) which may affect
In general however, relevance is calculated
by noting where the term occurs within the text and assigning this position
a "weight" or level of importance. Some search utilties also include a
popularity element in calculating the relevance algorithm; that is, the
more a site is linked to or used, the higher the rating. Search terms
occurring in the title, summary, in key positions within a paragraph or
appearing several times within a paragraph usually carry more "weight"
because there is a higher probability that terms in these positions indicate
significant material on the topic.
This is very similar to our book index example
above; because the term
apple occurs many times and in key
positions (title, table of contents, beginning of paragraphs) there is
a high probability that the document contains significant information about
Note that orange also occurs in the table of contents, an
indication of the term's relative importance (it is a significant topic,
but not as important as apple). The algorithm of the search
engine and the methodology it uses to calculate relevance emulate the observations
and judgments we make based on our experience. A search engine will
return the terms in our book index as hits when the search terms
and grape are requested whereas a human might judge that
although the two terms occur within the document, there is no significant
between them and is hence irrelevant.
Some search engines look only in certain fields
to index documents such as the title field, first paragraph and in something
called "meta-tags." Meta-tags allow the creator of a web site to
add descriptive keywords which are not displayed in the actual web documents;
they are specifically to enhance retrieval of the document. As people "spam"
the search engine (for example, by repeating terms over and over again)
meta-tags are decreasing in importance because the folks that program the
'bots train them to overlook repetitions and other clues to "spamming."
each search engine assigns relevancy rankings differently, if you execute
exactly the same search in several search engines you will have different
results in terms of how and where the url's are listed (even if the database
contents are identical).
the best search engine?
I'm sure I'm going to disappoint a lot of folks
by giving the answer "the best search engine is the one that fits
the task" instead of recommending a particular utility. Until
you have some experience with knowledge seeking tools and importantly,
with identifying your real information need (for example, a query
on "Leonardo di Vinci's Mona Lisa" is likely to be more successful than
"that lady with the smile by a Renaissance artist" ( or simply "di Vinci")
or "dosage and usage guidelines for St. John's Wort" as opposed to "St.
John's Wort") it may be difficult to ascertain which tool is best for your
purpose. But the good news is, you will make better choices with
do I use? well, that depends....
Remember I am a librarian in an academic (college)
library, so I never know what the next information request will be (that's
the fun part!). But this means in practical terms that I am looking for
information in a variety of places, which precludes having a standard game
plan..... here's a few of my search tactics/favorite tools:
tips in Lesson 4!
for general use, I use Altavista
(http://www.altavista.com). It's fast, returns good hits and is accurate.
Plus its database is huge (alternates with Hotbot as the largest web database).
Altavista also has a nice refine feature for weeding out irrelevant hits.
for quick queries where I want precision (accuracy)
in results I'll use Google.com (http://www.google.com).
It's fast and uncannily accurate. The "cached" page feature is useful when
the actual Web site is busy and unreachable.
increasingly I find myself using "Ask
Jeeves" (http://ask.com) because the knowledge base gives me some good
ideas. If I need to find out the "literature" (i.e. what's available on
the Web) of a subject or find quick definition, I'll go here first.
for searching by domain (.edu, .com, .gov) I
use Hotbot (http://www.hotbot.com).
I also use Hotbot for field searching since it has a nice template in its
for specific subjects (rather than a specific
query), I might use a specialized directory or search engine particularly
in the Arts, Education and Health.
I tend to stay away from meta-search engines
(which search multiple search engines at once) because they strip away
my Boolean or field commands. I would however, recommend them for general
searches where advanced searching techniques will not be used. I also use
them for comparing results of a search across several seach engines.
if I want to group my hits by related topics
or more "academic" type of research, I use NorthernLight
if I want to use concept searching, (find a good
web site and then look for others using the same criteria) I use Excite
or Google (http://www.google.com)
frequently I will change tactics in mid search
-- if I get too many hits, I'll weed a few out. If I do not find anything
relevant, I'll switch to a different source and/or modify my "search statement"
are simple ways to make my search more effective?
A very effective way to increase the relevance
or precision of "hits" is to search as a phrase. In most cases simply
means putting quotation marks around the search terms. "Red socks"
is a different search than red socks in most search engines. What
you are actually doing by searching as a phrase is using the concept of
which concerns the terms' physical closeness to one another (that is, their
A document with red socks occurring close or next to each other
are more likely to be on target than a document with
red in the
title and socks buried in the text.
Another way to increase your search effectiveness
is to be as specific as possible; that is including as many terms and synonyms
as you can think of to fully describe your topic. Instead of
women and computers
(woman or women) and (technology or computer)
and (training or professional development) and (barriers or problems)
Note: search utilities may not support
the use of parentheses (called nesting) in basic searches although
many support them in their "advanced" searches.
So to recap, phrase
specificity are two simple ways to
increase precision in searching.
are the most popular and useful search utilities? (the "major" search engines)
Ok folks. We are looking
at a sampling of search engines and describing generalities; we are not
attempting to create a definitive listing. For example, we'll be discussing
search engines in Lesson 6, so you won't find them listed here.
are more "major search engines" for you to evaluate in Assignments
Alta Vista (http://av.com)
Originally developed by Digital Equipment
Corporation, Alta Vista searches the Web and Usenet.
In its very large database, both simple
and advanced searching are supported with the ability to limit searches
to select portions of web documents. For example, it is possible to limit
searches to title, domains, images and links within Web documents and by
particular newsgroups or subjects in Usenet. Also, ability to browse by
subject (although this is rather slow).
Search site featuring a very large database
and a lot of "extras" such as: Excite Channels (guide to sites by subject),
stock quotes, news, tv and searching of Newsgroups. Offers concept
Voted no. 1 among search engines by PC Magazine,
Hot Bot offers a sophisticated interface with a vast array of options such
as: searching by dates, by certain domains in the U.S. (e.g. .com, .org,
.edu, .gov), by media type (e.g. image, audio, video). Also, a huge
database, powerful advanced searching options, access to other search tools
by type and a subject guide.
Search Engines and Collections:
Specialized search engines are most often programmed
to "collect" web documents along a topical theme. For example, in the Arts,
Science, Health-related topics or even more specialized subjects such as
Ancient History of the Mediterranean.
Also fitting in this category are "search
tools" that really calculate rather than retrieve information (such as
those fitting in the "distance between two points" or "salary differential"
categories). Since it is impossible to list specific tools here, the following
are sites which group or list subject specific search engines or tools:
Beaucoup is a collection of approximately
1000 search engines, directories and indices from all over the world, organized
into categories such as: General Searchers, Reviewed Sites/What's New,
Software, Reference, Education, Art/Graphics, Social/Environmental/Political
Concerns, and Consumer Medicine. Good starting point for popular subjects.
Search Engine Collections
from the University of Albany by Laura
more information visit these excellent sites:
This week we are going on an Infoquest! Please
find answers to the following questions using either a subject directory
that we discussed in Lesson 2, or a search engine.
Remember -- there are
many routes to the same information....
Where can I find information about the wreck
of the Titanic? (hint: maritime history)
Where can I find a site telling me about the
Chinese New Year? (hint: seasonal)
Where can I find directions to my house? (hint:
use a map)
What is icq? (hint: Internet term)
Where can I find statistics on new home owners
in Kentucky? (hint: .gov)
Evaluate the following "major" search engines:
Consider the following criteria:
how large is the database?
how frequently is the database updated?
how accurate were your search results?
how easy is the interface to use?
what advanced search facilities are available?
Find a search facility that will help you
find the following types of information. Please include a sample question/reason
a picture of a school bus
someone's email address
the zip code of your best friend
a recipe you saw posted on a newsgroup (not listserv!)
the latest world news
a computerized calendar (software)
Last updated: November 6, 2002; Links checked:
November 6, 2002
Copyright © 1998-2002,
Leroy, Virtuallibrarian, Trainer and Corporate Librarian :)
| Lesson 1 | Lesson
2 | Lesson 2a | | Lesson
4 | Lesson 5 | Lesson
5a | Lesson 6