Corpus Linguistics for Information Retrieval
Project Award Date: 0000-00-00
Rapidly increasing storage media capabilities and spreading interconectivity have heralded the arrival of the information age. Unfortunately, accessing on-line information remains an inexact science. While valuable information can be found, typically many irrelevant documents are also retrieved and many relevent and many relevant ones are missed. Terminology mismatches between the user's query and document contents are one cause of retrieval failures. Expanding a user's query with related words can improve search performance, but the problem of identifying related words remains.
This research uses corpus linguistics techniques to automatically discover word similarities directly from the contents of an untagged tectual database and to incorporate that information in an information retrieval system. These similarities are calculated based on the contexts in which the words appear. Using these similarities, user queries are automatically expanded, resulting inconceptual retrieval rather than requiring exact word matches between queries and documents. The effects of using different algorithms to calculate the similarities and the effects of expanding different sets of query words is evaluated.
In addition, the search performance of the retrieval engine serves as a task-based method for comparing the quality of word-word similarities calculated using different corpus linguistics techniques.
We have demonstrated improved search results on the TREC-5 database and dramatic improvements with the Cystic Fibrosis database. Work is currently being done to extend the results to distributed databases.
For More Information: http://www.ittc.ku.edu/~sgauch/corpus.html
Primary Sponsor(s): NSF (Infrastructure Grant)