NLTK: Find contexts of size 2k for a word

Question

I have a corpus and I have a word. For each occurrence of the word in the corpus I want to get a list containing the k words before and the k words after the word. I am doing this algorithmically OK (see below) but I wondered whether NLTK is providing some functionality for my needs that I missed?

def sized_context(word_index, window_radius, corpus):
    """ Returns a list containing the window_size amount of words to the left
    and to the right of word_index, not including the word at word_index.
    """

    max_length = len(corpus)

    left_border = word_index - window_radius
    left_border = 0 if word_index - window_radius < 0 else left_border

    right_border = word_index + 1 + window_radius
    right_border = max_length if right_border > max_length else right_border

    return corpus[left_border:word_index] + corpus[word_index+1: right_border]

Justin O Barber · Answer 1 · 2014-03-01T23:52:42.983

If you want to use the nltk's functionality, you can use nltk's ConcordanceIndex. In order to base the width of the display on the number of words instead of the number of characters (the latter being the default for ConcordanceIndex.print_concordance), you can merely create a subclass of ConcordanceIndex with something like this:

from nltk import ConcordanceIndex

class ConcordanceIndex2(ConcordanceIndex):
    def create_concordance(self, word, token_width=13):
        "Returns a list of contexts for @word with a context <= @token_width"
        half_width = token_width // 2
        contexts = []
        for i, token in enumerate(self._tokens):
            if token == word:
                start = i - half_width if i >= half_width else 0
                context = self._tokens[start:i + half_width + 1]
                contexts.append(context)
        return contexts

Then you can obtain your results like this:

>>> from nltk.tokenize import wordpunct_tokenize
>>> my_corpus = 'The gerenuk fled frantically across the vast valley, whereas the giraffe merely turned indignantly and clumsily loped away from the valley into the nearby ravine.'  # my corpus
>>> tokens = wordpunct_tokenize(my_corpus)
>>> c = ConcordanceIndex2(tokens)
>>> c.create_concordance('valley')  # returns a list of lists, since words may occur more than once in a corpus
[['gerenuk', 'fled', 'frantically', 'across', 'the', 'vast', 'valley', ',', 'whereas', 'the', 'giraffe', 'merely', 'turned'], ['and', 'clumsily', 'loped', 'away', 'from', 'the', 'valley', 'into', 'the', 'nearby', 'ravine', '.']]

The create_concordance method I created above is based upon the nltk's ConcordanceIndex.print_concordance method, which works like this:

>>> c = ConcordanceIndex(tokens)
>>> c.print_concordance('valley')
Displaying 2 of 2 matches:
                                  valley , whereas the giraffe merely turn
 and clumsily loped away from the valley into the nearby ravine .

Thanks, that looks more NLTK-esk, but still the logic is handcrafted. I was hoping for something implemented, tested and most important: optimized within the scope of the framework. — Zakum, Mar 04 '14 at 15:40

score 3 · Accepted Answer · answered Jun 08 '15 at 16:42

The simplest, nltk-ish way to do this is with nltk.ngrams().

words = nltk.corpus.brown.words()
k = 5
for ngram in nltk.ngrams(words, 2*k+1, pad_left=True, pad_right=True, pad_symbol=" "):
    if ngram[k+1].lower() == "settle":
        print(" ".join(ngram))

pad_left and pad_right ensure that all words get looked at. This is important if you don't let your concordances span across sentences (hence: lots of boundary cases).

If you want to ignore punctuation in the window size, you can strip it before scanning:

words = (w for w in nltk.corpus.brown.words() if re.search(r"\w", w))

NLTK: Find contexts of size 2k for a word

2 Answers2

Linked