Probabilistic Information Retrieval and Language Model for Information Retrieval

WSM Assignment 2: Probabilistic
Information Retrieval and
Language Model for Information
Retrieval
1. [20 points] Probabilistic Information Retrieval.
? [10/20 points] What are the differences between standard vector space
tf-idf weighting and the RSJ probabilistic retrieval model (in the case
where no document relevance information is available)?
? [10/20 points] Describe the differences between vector space relevance
feedback and probabilistic relevance feedback.
2. [15 points] Language Model.
? Consider making a language model from the given following training text:
?
? the martian has landed on the latin pop sensation ricky martin
?
? [6/15 points] Under a MLE-estimated unigram probability model,
what are the probabilities of P(the) and P(martian)?
?
? [9/15 points] Under a MLE-estimated bigram model, what are the
probabilities of P(sensation | pop), P(pop | the), and P(pop | latin)?
?
3. [30 points] Language Model for Information Retrieval.
? Suppose we have a collection that consists of the 4 documents given in
the below table.
?
?
Try to build a query likelihood language model for this document collection.
Assume a mixture model between the documents and the collection, with both
weighted at 0.5, and maximum likelihood estimation (MLE) is used to estimate
both as unigram models. Work out the model probabilities of the queries “click”,
“shears”, and hence “click shears” for each document, and use those
probabilities to rank the documents returned by each query.
? [24/30 points] Fill in these probabilities in the below table:
?
?
? [6/30 points] What is the final ranking of the 4 documents for the query
“click shears”?
4. [35 points] Classic Probabilistic Retrieval Model.
? [15/35 points] In the derivation of the Robertson-Sparck-Jones (RSJ)
model (see the slides and the survey paper by Norbert Fuhr for detail
about this derivation), a multi-variate Bernoulli model was used to model
term presence/absence in a relevant document and a non-relevant
document. Suppose, we change the model to a multinomial model (see
the slide that covers both models for computing query likelihood). Using a
similar independence assumption as we used in deriving RSJ, show that
ranking based on probability that a document is relevant to a query Q,
i.e., p(R=1 | D,Q), is equivalent to ranking based on the following
formula:
?
? where the sum is taken over all the word in our vocabulary (denoted by
V). How many parameters are there in such a retrieval model?
?
? [5/35 points] The retrieval function above won’t work unless we can
estimate all the parameters. Suppose we use the entire collection
C={D1,…,Dn} as an approximation of the examples of non-relevant
documents. Propose an estimate of p(w|Q,R=0). (Hint: study the slide
about how to do this for the RSJ model.)
?
? [5/35 points] Suppose we use the query as the only example of a relevant
document. Propose an estimate of p(w|Q,R=1).
?
? [5/35 points] With the two estimates you proposed, we should now have a
retrieval function that can be used to compute a score for any document D
and any query Q. Write down your retrieval function by plugging in the
two estimates. Can your retrieval function capture the three major
retrieval heuristics (i.e., TF, IDF, and document length normalization)?
How?
?
? [5/35 points] Do you believe your formula would work well as compared
with a state of the art formula such as BM25? Can you propose a way to
further improve your formula? (While it’s the best if you could improve
your formula through using an improved estimate of p(w|Q,R=1), it would
also be fine to propose any reasonable heuristic modification of the
formula.)
?
Submission Details
? Due: 18:10 in class, Tuesday, 18 April 2017
? What to turn in:
?
? Please turn in a hardcopy of your written answers

find the cost of your paper