More Search Engine Hype and Reality.
How well do Google's indexing and PageRank features measure up?
Don't look for the word "Google" in your dictionary. It will not be there. Neither is it one of the words that I make up to the dismay of my editors (and my students). It is a derivative of "googol." And what is googol? It is 10 i.e., I with a hundred zeros after it. It was coined- according to the Merriam-Webster Dictionary-by Milton Sirotta nearly 50 years ago. So what is Google? It is a new search engine (http://www.google.com) developed by talented Stanford University graduates. And anything that comes out of Stanford and has to do with the Internet will make investors and venture capitalists see a googol of money. But can you take it to the bank? I will tell you in this sequel to my previous column about another much-hyped Web search engine, Direct Hit.
What Is Google?
Google is a search engine with a twist. It crawls Web pages, as other search engines do, and indexes them. At this stage it has a relatively small collection of about 30 million Web pages compared to the biggest collections-Northern Light, AltaVista, and HotBot each have at least three times as many pages.
It is the indexing that is different and unique-for those who have never heard of citation indexing, that is. Each page is assigned a PageRank that is calculated by a) how many other pages refer, or link, to the page, and b) how important the linking pages are. If this idea sounds familiar, that's because it is. The concept is analogous to the one Eugene Garfield developed 30 years ago, and what the Institute for Scientific Information implemented in its citation database and in the Journal Citation Reports (JCR). PageRank is somewhat similar to (although far less sophisticated or scholarly than) the Impact Factor of journals. Google automatically assigns a PageRank to Web pages based on a) the number of other pages that cite it, and b) the PageRank of those pages. It-sounds good but it has some flaws.
There is a simple input cell where you enter your search request. It may be a single word (publishing), a combination of words with an implied "and" relationship (Web database publishing), an exact phrase ("Web database publishing"), or two or more words but not other words specified (Web database publishing-Java). Google does not allow an "or" operator, nor does it allow truncation symbols. This means that one needs to formulate two or more queries to retrieve pages that include both the singular and the plural form (database or databases), and the different spellings of both (data base, data bases). This is a surprising and inconvenient limitation in 1999. A simple checkbox next to the query cell could be used to allow the users to enable/disable stemming.
Entering "Information Today" yields 6,078 hits (Figure 1). It sounds flattering, but the overwhelming majority of hits were pages that included the query term as "News Information; Today's Weather," or "Call for information today" (Figure 2). The entries in the result list show the PageRank of the cited page, its URL, an excerpt from the page with the matching term, a hotlink to the page when it was cached (for indexing), the size of the page, and the number of times the query term occurs on the page.
Clicking on the URL following the PageRank score takes you to the current page, and clicking on the cached hotlink takes you to the page as it was when it was indexed. Clicking on the relevance bar will list the citing pages (Figure 3). I tried the first five but none of them seemed relevant (a science summer camp, a poem by Heinrich Heine in German, a wine gourmand page, etc.), and the term "Information Today" did not appear on any of the citing pages.
You might say that this particular query is difficult to handle because of the excessive noise even when doing an exact phrase search. I tried a couple of others, too. For example, I searched for the term Google (fair enough, no?) on several search engines and looked at the first 10 hits. The results were very similar. Every search engine picked up the Stanford sites first, then every one of them brought up among the top 10 hits one or more sites about the "Barney Google" comic strip.
Then I tried the link search (the manual alternative of what Google automates) in three search engines that offer this option (AltaVista, HotBot, and infoSeek), and the top 10 citing sites that cited (linked to) Google were better (more relevant) than the citing sites identified by Google. Some of the differences are of course due to the fact that Google has by far the smallest collection of these search engines.
An Interesting Concept, but
Although the concept of finding the most important sites about a topic through links is interesting, the implementation as of March 1999 is not convincing. More importantly, I found some naive ideas surfacing in the heavy-breathing press coverage of Google. The most troublesome one to me is the idea that the PageRank is an objective measure that is not influenced by spamming.
Spamming has several meanings, of course. One of them is a Web site design practice in which designers repeat relevant words in the title or first paragraph to get high ranking. The smarter ones hide these words from the naked eye by using the same color for the background and the font, still leaving the words visible for the crawler programs. (Some search engines now watch out for this trick, and ignore sites that do this or deduct from their relevance scores.) Another variety of spamming is to use catchy words and their synonyms to fool the crawler programs and the users. For example, someone selling bathtubs flooded his page with terms that are used most often in Web searches, such as free, and "rich" to name a few that are fit to print. His page certainly got a lot of visits, and some visitors may have indeed ordered an indoor jacuzzi (in anticipation of the good times), but the guy threatened the search engine company with a lawsuit for loss of business when his site was excluded from the index.
Now, one of the claims about Google is that its PageRanks cannot be manipulated by spamming because it must earn its rank by being cited by other pages with high PageRank. Well, citation indexing has been around for decades but still cannot (and does not claim to) be quite objective. (ISI deserves credit for warning users of the possible reasons for bias in the rankings in the help file of the CD-ROM version of the Journal Citation Reports.) I am a great fan of citation analysis (especially the computer-assisted genre), but I am not blind to its limitations, and I treat ranks with a grain of salt. Citing others in scholarly publications is a must, but this does not mean that the authors' works really are related, or that the citing author has even read the cited works. Citations are often just courtesy citations that pay homage to most-often-cited works in the specific area. Tenure track professors learn sooner or later that it does not hurt to pad the bibliography by citing top-ranked journals, and to cite the journals their papers are submitted to. Citing is cheap and safe as long as the citation is positive and does not challenge the cited work. Such citation practices certainly distort the ranking of journals, and there are other factors as well. Competent users probably would not take at face value that in the most current ranking of journals, JCR lists MIS Quarterly as the journal with the second highest Impact Factor in library and information science. I certainly have doubts about it.
Citing other sites on a Web page is not a must, but it is common practice as every home page owner wants to share with us what he or she thinks the best sites for organic food, grunge bands, or search engines are. On the other hand, Web pages don't go through any kind of peer review (not as if peer review would be a panacea even if it is a double-blind one by really competent persons). Nor do they pass editorial review or any kind of authentication. You may dislike my Internet Insights column, but it goes through editors of a well-circulating periodical. This process provides a level of authentication (and corrects my grammar and simplifies my convoluted sentences). I keep saying that the Web is the largest vanity press and anything goes (including many real gems among the rubbish). And here is the rub.
I-Cite Your Site for a Dime a Day
Young scam artists who are late to the traditional rip-off games (phone scamming, work-at-home scams, timeshare condos, and credit card deals) have engaged in or are readying themselves to enter the lucrative Web fraud market. According to the Internet Fraud Watch consumer group, 1,780 complaints were filed in 1997, and 7,752 in 1998. You can sense the excitement of the fraud crowd. If lowlifes see a dime in it they will offer to cite anyone from Web pages that they can spawn cheaply and endlessly under a variety of pseudonyms. With so many free offers for hosting Web pages, there is not even a minimal investment required. They will first cross-cite each other from their simplest Web pages made up of just a few hotlinks. This will establish their own relatively high PageRank. Then they move in with their offer: "I cite your site for 10 cents a month, success guaranteed overnight or your money back."
Do you think it is unlikely? Just look at Google's result lists now. Of the 1,065 citing pages there were way too many mind-numbing home pages with a hotlink to Google but nothing else. All of the citing pages had zero PageRank, as they were not yet cited by others (or not crawled since citing by Google). Every Jane and Joe put hotlinks on their sites without bad intent. Look at the typical home page that came up for a search about Google (Figure 4). I blocked out the name of the guy as he is just boring, not fraudulent. I don't care much about this typical home page gabbing or for his confessions about Web pages and aesthetics. But Google ranked it high, close to the home page of Sergey Brim, one of the developers of Google. Why this ranking? Because indeed there is a link to the site whose URL included the search term. There were dozens of similar home pages. Can you imagine what will happen when someone can get paid for posting these or worse pages? There is no editorial process so anything goes, and such Web pages can be posted in assembly-line fashion as pseudo personal thank-you cards by politicians.
I bet a googol that this automatic page ranking will be disappointing unless the concept is refined. One possibility is to use a collection of potentially citing sources that have been editorially selected, like some of the best Web Subject Guides, and to use these to check which sites about the topic they link to the most often. I will talk about the latest Web software developments in a full-day session at the National Online Meeting. Sight me there.
|Printer friendly Cite/link Email Feedback|
|Date:||Apr 1, 1999|
|Previous Article:||IndexMaster Adds Law Titles.|
|Next Article:||To Panic or Not to Panic About Y2K.|