Sunday, August 1, 2010

How good is a span of terms?: exploiting proximity to improve web retrieval

Proximity is probably the most important ranking signal ever. In my experience, it is sometime more important than static ranking (such as PageRank) or dynamic ranking (based on some machine learning approach). The problem with proximity is how to determine the span window of maximum allowed proximity among matching query words.

How good is a span of terms?: exploiting proximity to improve web retrieval is an interesting paper by the way of Microsoft research, which extends traditional BM25 ranking with the idea of having a dynamically chosen span window. The optimal window is then selected using a machine learning approach based on LambdaRanking

Two interesting results showed in the paper are (a) the fact that head queries and tail queries need to have different span windows, and (b) the fact that sentences extracted by large collections such as Wikipedia produce limited benefit.

One assumption made by the paper is that the goodness of a document can be modeled as the sum of the span vectors describing each feature. Machine learning is then used to learn the weights in the associated linear combination. Not sure if the linear model is the most appropriate here.

No comments:

Post a Comment