Efficient And Effective Large-Scale Textual Search

Anagha Kulkarni, Computer Science Department, San Francisco State University

Stevenson Hall 1300
12:00 PM - 12:50 PM

The traditional search solutions for large datasets assume access to practically unlimited computational resources, and thus cannot be employed by small-scale organizations. Our work introduces Selective Search, a new retrieval approach that processes large volumes of data efficiently and effectively in computationally constrained environments. To achieve this, Selective Search, partitions the dataset into subsets (shards) in such a way that at query execution time only a few selected shards need to be searched for a query. The dataset is divided into shards based on the similarity of the documents, thus creating topically homogenous partitions (e.g. politics, sports, technology, and finance). This topic-based organization of the dataset concentrates the relevant documents for a query into a few shards. During query evaluation a few shards, that are likely to contain the relevant documents for the query, are identified and searched. Empirical evaluation using some of the largest available datasets (e.g. half a billion web pages) demonstrates that Selective Search reduces search costs dramatically without degrading search effectiveness, and operationalizes this using very few computational resources.