Metasearch engine seeks to lessen Web's chaos
From Kansas City Star
By David Hayes
A lot of people got really bent out of shape last month when a think tank reported that there are 800 million pages out on the Web, but even the most thorough Web search engine lists only about 16 percent of them.
The sheer number of Web pages reported by the NEC Research Institute in Princeton, N.J., was amazing. Three years ago, the top Web search engine listed what seemed then an astounding 30 million pages.
But all this concern over what is and isn't listed by www.northernlight.com, www.altavista.com, www.snap.com and the dozens of other search programs? I don't get it.
The Web is a mess. You type in "cat" in most search engines and it'll come back with everything from virtual pet cemeteries to gourmet food shops to online porn emporiums.
I don't want more. I want better.
That conundrum has been burned into Susan Gauch's computer screen for about six years. Gauch, a computer scientist at the University of Kansas Information and Telecommunication Technology Center, is the mind behind www.profusion.com, a "metasearch" engine that uses other search engines and their past performance to come up with better results.
Gauch knew there had to be a better way to search the Web when she started working on a grant proposal for a research project after moving to KU in 1993. She was one of the first to come up with a working metasearch engine -- a single Internet program that searched four other search engines at the same time.
In general, search engines look for key words and determine how often a word appears on a Web page. That's why if you type in the word dog, "you're likely to get a page from a kennel club where the word appears three times and somebody's personal home page that mentions `my dog Buffy' three times," Gauch said.
Profusion, however, searches other search engines and also looks at the way Web surfers search those pages. It considers the pages from the list they click on, and how often they stay on that page. That information, gathered anonymously, is used to rate pages Profusion shows in the future.
The early version of Profusion has blossomed into a commercial venture based in Michigan. Gauch, at KU, continues as chief technical officer.
About 500,000 people use Profusion every month now, and the search program has been improved further. A new feature allows Web searchers to look for specialized information on specialized sites.
For instance, the metasearch engine can be told to look for sports-related information only on top sports news sites.
"This gives much more up-to-date information," Gauch said. "The assumption is that while AltaVista may list pages only every six months, www.espn.com probably indexes its own site every evening."
That eliminates another problem found by the NEC researchers. The study found that new Web pages aren't cataloged by the search engines for an average of six months.
Profusion currently has "on target" speciality searches for entertainment, health, children, MP3 music, sports and USENET postings. More are planned.
The NEC study found that the overlap between different search engines is fairly low and suggested that metasearch engines may be the most efficient way to search the Web.
But Gauch isn't stopping there.
"It's not just that there's 800 million pages, but also that they vary wildly in quality," Gauch said. "Some information is good, some information is poor."
She's now working on temporal web searches. She wants to find a way to rate information on the Web based on when it was last updated.
"There's a lot of out-of-date information out there," Gauch said. "Once it's posted, it doesn't self-clean."
However, because faster is sometimes better, there's a new search engine worth checking out.
www.alltheweb.com is without question the quickest draw on the Web. Using a high-speed data line, I keyed in "kansas city" and received 210,060 listings in less than half a second.
In comparison, the Web's largest search engine, www.northernlight.com, took eight seconds but came back with more -- 568,940 listings. The new www.google.com search engine took only four seconds, but stopped at just under 15,000 pages.
Among other top searchers: www.snap.com, eight seconds; www.hotbot.com, six seconds; http://search.msn.com, 15 seconds; www.directhit.com, six seconds; and www.infoseek.com, seven seconds.
For more information, contact ITTC.