Find dictionary items whose key matches a substring

Question

I have a large dictionary constructed like so:

programs['New York'] = 'some values...' 
programs['Port Authority of New York'] = 'some values...' 
programs['New York City'] = 'some values...'
...

How can I return all elements of programs whose key mentions "new york" (case insensitive)? In the example above, I would want to get all the three items.

EDIT: The dictionary is quite large and expected to get larger over time.

score 110 · Accepted Answer · answered May 07 '12 at 14:58

110

[value for key, value in programs.items() if 'new york' in key.lower()]

answered May 07 '12 at 14:58

mensi

9,580
2
34
43

1

Exactly. Just don't expect it to be fast if your dictionary is large. – Mark Ransom May 07 '12 at 15:07
@MarkRansom I was just about to add that my dictionary is quite large and expected to get larger. It has been doing `programs.get('new york')` up to now which has been very fast. – Abid A May 07 '12 at 15:08
1

If going through all keys in the dictionary is too slow for your application, you need to build a datastructure targeted at this kind of query. That would probably be either some sort of word-based inverted index or a suffix tree. – mensi May 07 '12 at 15:42
@mensi. Thanks. I'm making the change now to see how it performs. I'll look into other data structures as well. – Abid A May 07 '12 at 15:45
2

This answer can also be wrapped with an any condition to get a boolean for whether any key in the dict contains a provided substring/condition: `any([x for x in programs if 'new york' in x.lower()])` – bsplosion Nov 30 '18 at 14:20

score 11 · Answer 2 · answered May 19 '14 at 11:15

11

This is usually called a relaxed dictionary and it can be implemented efficiently using a suffix tree.

The memory used by this approach is linear over the keys, which is optimal, and the time of search is linear over the substring length you are searching, which is also optimal.

I have found this library in python that implements this.

https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

answered May 19 '14 at 11:15

jordi

111
1
2

2

It says, page not found. – amk May 17 '17 at 17:53
I guess the linked page is now here https://www.hashcollision.org/hkn/python/suffix_trees/ but the code has not been maintained. There's a link to a fork but that is abandoned as well. – Janne Karila Mar 01 '18 at 06:51

score 6 · Answer 3 · edited May 23 '17 at 12:17

You should use the brute force method given by mensi until it proves to be too slow.

Here's something that duplicates the data to give a speedier lookup. It only works if your search is for whole words only - i.e. you'll never need to match on "New Yorks Best Bagels" because "york" and "yorks" are different words.

words = {}
for key in programs.keys():
    for w in key.split():
        w = w.lower()
        if w not in words:
            words[w] = set()
        words[w].add(key)


def lookup(search_string, words, programs):
    result_keys = None
    for w in search_string.split():
        w = w.lower()
        if w not in words:
            return []
        result_keys = words[w] if result_keys is None else result_keys.intersection(words[w])
    return [programs[k] for k in result_keys]

If the words have to be in sequence (i.e. "York New" shouldn't match) you can apply the brute-force method to the short list of result_keys.

score 5 · Answer 4 · answered May 07 '12 at 15:03

5

An iteritems and a generator expression will do this:

d={'New York':'some values',
    'Port Authority of New York':'some more values',
    'New York City':'lots more values'}

print list(v for k,v in d.iteritems() if 'new york' in k.lower())

Output:

['lots more values', 'some more values', 'some values']

answered May 07 '12 at 15:03

the wolf

34,510
13
53
71

`iteritems()` is no more https://stackoverflow.com/a/10458567/4539999 – flywire Jun 27 '23 at 12:04

score 5 · Answer 5 · answered May 07 '12 at 15:50

You could generate all substrings ahead of time, and map them to their respective keys.

#generates all substrings of s.
def genSubstrings(s):
    #yield all substrings that contain the first character of the string
    for i in range(1, len(s)+1):
        yield s[:i]
    #yield all substrings that don't contain the first character
    if len(s) > 1:
        for j in genSubstrings(s[1:]):
            yield j

keys = ["New York", "Port Authority of New York", "New York City"]
substrings = {}
for key in keys:
    for substring in genSubstrings(key):
        if substring not in substrings:
            substrings[substring] = []
        substrings[substring].append(key)

Then you can query substrings to get the keys that contain that substring:

>>>substrings["New York"]
['New York', 'Port Authority of New York', 'New York City']
>>> substrings["of New York"]
['Port Authority of New York']

Pros:

getting keys by substring is as fast as accessing a dictionary.

Cons:

Generating substrings incurs a one-time cost at the beginning of your program, taking time proportional to the number of keys in programs.
substrings will grow approximately linearly with the number of keys in programs, increasing the memory usage of your script.
genSubstrings has O(n^2) performance in relation to the size of your key. For example, "Port Authority of New York" generates 351 substrings.

Thanks for the suggestion. I was thinking of this when mensi above mentioned an inverted index. At this point in the project, I will have to choose performance over memory usage. So I'll test this out as well. — Abid A, May 07 '12 at 16:06

Find dictionary items whose key matches a substring

5 Answers5

Linked