Searching a normal query in an inverted index

Question

I have a full inverted index in form of nested python dictionary. Its structure is :

{word : { doc_name : [location_list] } }

For example let the dictionary be called index, then for a word " spam ", entry would look like :

{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }

so that, the documents containing any word can be given by index[word].keys() , and frequency in that document by len(index[word][document])

Now my question is, how do I implement a normal query search in this index. i.e. given a query containing lets say 4 words, find documents containing all four matches (ranked by total frequency of occurrence ), then docs containing 3 matches and so on ....

**

Added this code, using S. Lott's answer. This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.

**

from collections import defaultdict
from operator import itemgetter

# Take input

query = input(" Enter the query : ")

# Some preprocessing

query = query.lower()
query = query.strip()

# now real work

wordlist = query.split()
search_words = [ x for x in wordlist if x in index ]    # list of words that are present in index.

print "\nsearching for words ... : ", search_words, "\n"

doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
    for p in d:
        doc_words[p].append(w)

# create a dictionary identifying matches for each document    

result_set = {}

for i in doc_words.keys():
    count = 0
    matches = len(doc_words[i])     # number of matches
    for w in doc_words[i]:
        count += len(index[w][i])   # count total occurances
    result_set[i] = (matches,count)

# Now print in sorted order

print "   Document \t\t Words matched \t\t Total Frequency "
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
    print doc, "\t",doc_words[doc],"\t",count

Pls comment .... Thanx.

"normal query search"? You might want to drop "normal" from that. It's just a query. What code have you written so far? What problemsa are you having with the code you've written. This is not `do_my_homework_for_me.com`. Please post your code and the detailed questions you have about **your** code. — S.Lott, Oct 15 '10 at 18:47
@ S. Lott : Sorry if the question seemed like a do_my_homework_for_me.com type question. I've realized my mistake and have added some details of where I had the problem. — easysid, Oct 16 '10 at 10:05
I have setteled for my code, given in the question. Marking S. Lott's answer as accepted. — easysid, Oct 18 '10 at 14:26

S.Lott · Accepted Answer · 2010-10-18T15:28:04.593

3

Here's a start:

doc_has_word = [ (index[word].keys(),word) for word in wordlist ]

This will build an list of (word,document) pairs. You can't easily make a dictionary out of that, since each document occurs many times.

But

from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
    doc_words[tuple(d.items())].append(w)

Might be helpful.

edited Oct 18 '10 at 15:28

answered Oct 15 '10 at 19:01

S.Lott

384,516
81
508
779

doc_words[d].append(w) ...... d is a list hence unhashable. Thanx for the code. it was of great help. I have edited the question and posted my code that I wrote using yours. – easysid Oct 16 '10 at 11:09
For the final code, see the question. The code there is written using this code. Marking this as accepted – easysid Oct 18 '10 at 14:27

score 0 · Answer 2 · answered Oct 15 '10 at 19:54

import itertools

index = {...}

def query(*args):
    result = []

    doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
    doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])

    for doc, group in doc_group:
        result.append((doc, sum([elem[1] for elem in group])))

    return sorted(result, key=lambda x:x[1])[::-1]

score 0 · Answer 3 · edited May 23 '17 at 12:24

Here is a solution for finding the similar documents (the hardest part):

wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])

wordMatches gets a list where each element is a dictionary of the document matches for one of the words being matched.

similarDocs is a set of the documents that contain all of the words being queried for. This is found by taking just the document names out of each of the dictionaries in the wordMatches list, representing these lists of document names as sets, and then intersecting the sets to find the common document names.

Once you have found the documents that are similar, you should be able to use a defaultdict (as shown in S. Lott's answer) to append all of the lists of matches together for each word and each document.

Searching a normal query in an inverted index

3 Answers3