|
Web Searching, Sleuthing and Sifting Lesson Three: What's next? (Search Engines and Web Indexes) What is a search engine? | What's a 'bot? | What's the index for? | How does a search engine list web sites? | What's the best search engine? | What do I use? | What are ways to make my search more effective? | What are the most popular search utilities? | Specialized Search Engines | For more information | Assignments In this lesson we will discuss how search engines work in general terms, not all possible scenarios (or search algorithms!). What is a search engine really and how does it work?What we think of as a search engine is really a team effort. There are 3 "members" of the team -- a mechanism that identifies web pages to be included in the database, a mechanism that indexes the sites and a searching mechanism with an interface, which scans, for keywords within the index. Users search the index (and hence, the database or web documents) through a query box or a template. Documents in which the search terms occur are presented as "hits."Although some facilities incorporate "natural
language" searching (searching by asking a question "Where are the doughnuts?"),
most search tools retrieve "hits" or "matches" by seeking occurrences of
your search terms within its database and by attempting to match the terms
(converted to a "string" of data bits) against its index. Because the terms
are converted to a digital string, the search engine must somehow be instructed
to include plurals and alternate forms of a term
What's a 'bot?A 'bot, otherwise known as an intelligent agent, spider, crawler, robot, or worm, is an automated device (software) which may be programmed to search for terms (data "strings") matching certain criteria. In terms of web search engines, a 'bot identifies and notes the url's of web pages to be included in the database. Later, another 'bot comes along and works on the interiors of the web documents, recording occurrences of words and their position within the text. This information is used to create a huge index. 'Bots travel along the links of a web site, that is, they crawl or traverse from one hypertext link to another.What's the index for?The index is how the search engine locates the url's which match your request. The web documents containing the query keywords are presented as a listing which may include a brief summary of the site. A simple way to understand the index is to think of it as a computerized book index. To discover where a topic occurs in a book, we would look up the word in the index which would indicate the page number(s) where the term occurs. Now imagine that every single word is included in the book index. A computerized version might be represented like this:
Some immediate observations might include:
How does a search engine decide how to list web sites matching my search terms?Each search engine uses a different algorithm or method to calculate something called a "relevance" which it "ranks." Have you ever noticed the numbers which sometimes appear next to the url's in a listing of search results? This is the "relevance ranking." Relevance means the probability that the "hit" or "match" is on-target with your query. The creators of search engines change the way they calculate relevance and do not tell us mere users their methodology; being high in the major search engines' rankings on a topic means big business.Sometimes Web site owners try to skew the odds of appearing on the first page of "results" for folks searching specific keywords. Being on the first page or in the top results increases the likelihood that the site will be seen and hence selected by the user. Unscrupulous folks "spam" the search engine to try to improve their rankings (and hence, their Web-based business) in a variety of methods including using "invisible text" (where text is colored the same as a background) or repeatedly using keywords in "meta-tags" (descriptive information not usually seen by the user unless when viewing the "page source" -- seel below). Perhaps most unsettling is the rising trend
of some search engines which in effect, sell higher ratings to companies
willing to pay for the priviledge. Most users will be unaware that the
set of search results has in effect been manipulated to boost these companies
ratings artificially.
In general however, relevance is calculated by noting where the term occurs within the text and assigning this position a "weight" or level of importance. Some search utilties also include a popularity element in calculating the relevance algorithm; that is, the more a site is linked to or used, the higher the rating. Search terms occurring in the title, summary, in key positions within a paragraph or appearing several times within a paragraph usually carry more "weight" because there is a higher probability that terms in these positions indicate significant material on the topic. This is very similar to our book index example above; because the term apple occurs many times and in key positions (title, table of contents, beginning of paragraphs) there is a high probability that the document contains significant information about apple. Note that orange also occurs in the table of contents, an indication of the term's relative importance (it is a significant topic, but not as important as apple). The algorithm of the search engine and the methodology it uses to calculate relevance emulate the observations and judgments we make based on our experience. A search engine will return the terms in our book index as hits when the search terms apple and grape are requested whereas a human might judge that although the two terms occur within the document, there is no significant relationship between them and is hence irrelevant. Some search engines look only in certain fields
to index documents such as the title field, first paragraph and in something
called "meta-tags." Meta-tags allow the creator of a web site to
add descriptive keywords which are not displayed in the actual web documents;
they are specifically to enhance retrieval of the document. As people "spam"
the search engine (for example, by repeating terms over and over again)
meta-tags are decreasing in importance because the folks that program the
'bots train them to overlook repetitions and other clues to "spamming."
What's the best search engine?I'm sure I'm going to disappoint a lot of folks by giving the answer "the best search engine is the one that fits the task" instead of recommending a particular utility. Until you have some experience with knowledge seeking tools and importantly, with identifying your real information need (for example, a query on "Leonardo di Vinci's Mona Lisa" is likely to be more successful than "that lady with the smile by a Renaissance artist" ( or simply "di Vinci") or "dosage and usage guidelines for St. John's Wort" as opposed to "St. John's Wort") it may be difficult to ascertain which tool is best for your purpose. But the good news is, you will make better choices with experience.What do I use? well, that depends....Remember I am a librarian in an academic (college) library, so I never know what the next information request will be (that's the fun part!). But this means in practical terms that I am looking for information in a variety of places, which precludes having a standard game plan..... here's a few of my search tactics/favorite tools:
What are simple ways to make my search more effective?A very effective way to increase the relevance or precision of "hits" is to search as a phrase. In most cases simply means putting quotation marks around the search terms. "Red socks" is a different search than red socks in most search engines. What you are actually doing by searching as a phrase is using the concept of proximity which concerns the terms' physical closeness to one another (that is, their proximity). A document with red socks occurring close or next to each other are more likely to be on target than a document with red in the title and socks buried in the text.Another way to increase your search effectiveness is to be as specific as possible; that is including as many terms and synonyms as you can think of to fully describe your topic. Instead of women and computers Note: search utilities may not support the use of parentheses (called nesting) in basic searches although many support them in their "advanced" searches. So to recap, phrase searching and specificity are two simple ways to increase precision in searching. What are the most popular and useful search utilities? (the "major" search engines)Ok folks. We are looking at a sampling of search engines and describing generalities; we are not attempting to create a definitive listing. For example, we'll be discussing meta search engines in Lesson 6, so you won't find them listed here.
Originally developed by Digital Equipment Corporation, Alta Vista searches the Web and Usenet. In its very large database, both simple and advanced searching are supported with the ability to limit searches to select portions of web documents. For example, it is possible to limit searches to title, domains, images and links within Web documents and by particular newsgroups or subjects in Usenet. Also, ability to browse by subject (although this is rather slow). Search site featuring a very large database and a lot of "extras" such as: Excite Channels (guide to sites by subject), stock quotes, news, tv and searching of Newsgroups. Offers concept searching. Voted no. 1 among search engines by PC Magazine, Hot Bot offers a sophisticated interface with a vast array of options such as: searching by dates, by certain domains in the U.S. (e.g. .com, .org, .edu, .gov), by media type (e.g. image, audio, video). Also, a huge database, powerful advanced searching options, access to other search tools by type and a subject guide.
Specialized Search Engines and Collections:Specialized search engines are most often programmed to "collect" web documents along a topical theme. For example, in the Arts, Science, Health-related topics or even more specialized subjects such as Ancient History of the Mediterranean.Also fitting in this category are "search tools" that really calculate rather than retrieve information (such as those fitting in the "distance between two points" or "salary differential" categories). Since it is impossible to list specific tools here, the following are sites which group or list subject specific search engines or tools:
(http://www.beaucoup.com) Beaucoup is a collection of approximately 1000 search engines, directories and indices from all over the world, organized into categories such as: General Searchers, Reviewed Sites/What's New, Software, Reference, Education, Art/Graphics, Social/Environmental/Political Concerns, and Consumer Medicine. Good starting point for popular subjects. (http://library.albany.edu/internet/engines.html#collections) from the University of Albany by Laura Cohen. For more information visit these excellent sites:
(http://www.monash.com/spidap.html) (http://www.searchenginewatch.com) (http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html) Teaching Library Internet Workshops, Teaching Libraries of University of California, Berkeley AssignmentsThis week we are going on an Infoquest! Please find answers to the following questions using either a subject directory that we discussed in Lesson 2, or a search engine.Remember -- there are many routes to the same information....
Evaluate the following "major" search engines:
Find a search facility that will help you find the following types of information. Please include a sample question/reason for inquiry.
Last updated: November 6, 2002; Links checked: November 6, 2002 Copyright © 1998-2002, Angela Elkordy, Virtuallibrarian, Trainer and Corporate Librarian :) virtuallibrarian@hotmail.com http://www.thelearningsite.net/cyberlibrarian/searching/lesson3.html Syllabus | Lesson 1 | Lesson 2 | Lesson 2a | | Lesson 4 | Lesson 5 | Lesson 5a | Lesson 6
|