For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search.
It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the previous N-1 words, it suggests the 5 most likely "next words" in descending order of probability, using Katz back-off.
We would like to extend this to predict phrases (multiple words) instead of a single word. However, when we are predicting a phrase, we would prefer not to display its prefixes.
For example, consider the input the cat.
In this case we would like to make predictions like the cat in the hat, but not the cat in & not the cat in the.
Assumptions:
We do not have access to past search statistics
We do not have tagged text data (for instance, we do not know the parts of speech)
What is a typical way to make these kinds of multi-word predictions? We've tried multiplicative and additive weighting of longer phrases, but our weights are arbitrary and overfit to our tests.
