In vs regular expressions for checking against a word blacklist -- estimating expected production performance

Question

I have many HTML pages in which I need to check the existence of blacklisted words. I know that the built-in in is much faster then regular expressions but here I'm trying to compare many in against a single regular expression.

Since

re.match() checks for a match only at the beginning of the string

I used a regex similar to .*(word|word...) and I replaced newline characters with a space.

Code

from timeit import timeit
import re
from urllib2 import urlopen

html = urlopen('http://en.wikipedia.org/wiki/Main_Page').read()

# Random reversed strings to avoid unwanted match + one secure match
words = [
    "zihw","elbadartnu", "retlob", "ssenenif", "nnub", "detartsehcro",
    "elbappirnu", "banehc", "rebmunbus", "gnizilodi", "noituac", "deludehcsnu",
    "/body", "latnanosnocerp", "cihportomeh"
]


def in_test(html, blacklist):
    html_lower = html.lower()
    return any(k in html_lower for k in blacklist):


def search_test(html, pattern):
    if re.search(pattern, html):
        return True
    return False


def match_test(html, pattern):
    html_line = html.replace("\r\n", " ").replace("\r", " ").replace("\n", " ")
    if re.match(pattern, html_line):
        return True
    return False


# patternX is word|word|word... patternX_exc is .*(word|word|...)
pattern5 = re.compile("|".join(words[:5]), re.I)
pattern5_exc = re.compile(".*(" + "|".join(words[:5]) + ")", re.I)

pattern10 = re.compile("|".join(words[:10]), re.I)
pattern10_exc = re.compile(".*(" + "|".join(words[:10]) + ")", re.I)

pattern15a = re.compile("|".join(words[:15]), re.I)
pattern15a_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)

words[12] = "doctype"  # A secure match at the beginning of the page
pattern15b = re.compile("|".join(words[:15]), re.I)
pattern15b_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)

words[12] = "featured list"  # A secure match at ~half page
pattern15c = re.compile("|".join(words[:15]), re.I)
pattern15c_exc = re.compile(".*(" + "|".join(words[:15]) + ")", re.I)

`in` vs `re.match` vs `re.search` with no match

print timeit("in_test(html, words[:5])", "from __main__ import *")
print timeit("search_test(html, pattern5)", "from __main__ import *")
print timeit("match_test(html, pattern5_exc)", "from __main__ import *")

0.127397060394
2.05020999908
2.17416286469


print timeit("in_test(html, words[:10])", "from __main__ import *")
print timeit("search_test(html, pattern10)", "from __main__ import *")
print timeit("match_test(html, pattern10_exc)", "from __main__ import *")

0.210324048996
3.73544692993
3.8765540123

These tests don't match any words. in is clearly the winner and the speed seems to increase linearly with the number of words.

`in` vs `re.match` vs `re.search` with match in different position

print timeit("in_test(html, words[:15])", "from __main__ import *")

# Match at the end
print timeit("search_test(html, pattern15a)", "from __main__ import *")
print timeit("match_test(html, pattern15a_exc)", "from __main__ import *")

# Match at the beginning
print timeit("search_test(html, pattern15b)", "from __main__ import *")
print timeit("match_test(html, pattern15b_exc)", "from __main__ import *")

# Match at ~half page
print timeit("search_test(html, pattern15c)", "from __main__ import *")
print timeit("match_test(html, pattern15c_exc)", "from __main__ import *")

The output is

0.258332967758

5.9074420929
0.0433299541473

0.000770807266235
6.0548210144

2.47815990448
3.25421690941

When a match occurs, the regex can be much faster than in but it depends on the position of the match. At the beginning, re.search will be better, at the end re.match is the better choice, at ~half page both will be significantly slower than in.

Regular expressions will help me not to duplicate words (eg. è, è, ...), and let me forget about upper/lowercase (especially with non ascii characters). But the speed seems to be too much variable and, on average, slower than in.

Are these tests correct? If so, are there any other built-in method that I can test or other procedures that will help me in this scenario? The blacklist is going to grow, so I need to take this into account.

Suggestion for an "in" solution: put the blacklist in a dict and check for each word of the text if the word is in the dict. — user189, Jul 07 '14 at 13:35
@user189 since a HTML page can have many words, reversing the loop wouldn't be slower? or the dict is the key here? — , Jul 07 '14 at 13:45
Not sure, you can try both. Not everybody wants a really big dict…um I guess… — user189, Jul 07 '14 at 13:50
@ivan_pozdeev scroll down the first box (the code one) for the patterns value — , Jul 07 '14 at 13:52
@Luca You ought to enclose the lists in patterns in `\b`'s , that'd improve the performance greatly. Unless you want to match parts of words too, that is. — ivan_pozdeev, Jul 07 '14 at 14:05

score 3 · Answer 1 · edited May 23 '17 at 12:14

The problem in general

It has a Time-space tradeoff:

The fastest possible (and most memory-demanding) solution is an N-nary tree (where N is the number of letters in alphabet). Each node has N pointers, each of those is non-null if there are words in the list with that letter as the next, and a flag which is set is there's a word that ends here.
Another fast implementation with much smaller footprint is the one behind T9 lookup.
a hash table (set in this case since you're only interested if a key is present) has larger overhead (hash calculations, conflict-related operations) but scales extremely well since it has almost constant lookup time in a typical case. Python's mapping types implementation automatically adjusts hash table's size to keep the potentially unlimited conflict-related overhead in check.
A regex (preferably optimized by minimizing the amount of backtracking to make) has negligible footprint but is slow since python uses a regex-directed engine that walks the text many times: it's text-directed one like the one of egrep that's more suited for the task. Others factor is its working time highly depends on the input (sometimes, catastrophically) and it doesn't scale well as the wordlist grows.
comparing against a list of words is essentially a primitive kind of text-directed regex engine. It doesn't do backtracking but has larger comparison and list-walking overhead. It may be faster or slower than a regex depending of how these overheads compare.

The specific question about comparing the two methods:

The tests are interpreted correctly - for the material they're performed on. But, as I said, both these methods' performance (hence, relative performance) depends dramatically on the wordlist size, input size and input itself and regex optimality.

Suggested course of action

So you ought to make tests on some realistic examples that model the typical use case. I.e.

optimize the regex if and in the same way you intend to in production
take the average on several documents, where
- the percentage of those with matches
- the distribution of match locations
- the relative occurence rate of wordlist words

is the same as expected to be in production.

I'd suggest to test the hash-table as well: it has larger initial overhead but on a large input and/or wordlist it should start outperforming the other two.

To avoid duplicating words, you may want to try out the methods with sanitizing input (lowercasing, &-seq replacements) prior to searching. Again, this is additional overhead that starts to pay off after a certain scale.

Minimizing the test data by using math

If match locations are presumed to be uniformly distributed and wordlist words to have equal occurence rate, the test data can be simplified to:

a text without match and without many words that start like wordlist words (the "best typical" input)
a text without match but entirely of words that start the same way as wordlist words, "heads" distributed roughly uniformly across the wordlist (the "worst" input for both methods - to see 1) if there can be catastrophic failures in production; 2) how much will such cases skew the end result)
a text with a match half-way, where locating the word in the wordlist requires roughly half the work from both text and regex matcher and without many other words that start like wordlist words

Then, the final "expected average" time:

Txa = (Tn+(Ts-Tn)*Ps)*Pn + (Tm+((Ts-Tm)*Ps)/2)*Pm

where T - times, P - expected probabilities; n - of input without match, s - (slow) of words that start like wordlist ones, m - of input with match.

In vs regular expressions for checking against a word blacklist -- estimating expected production performance

Code

`in` vs `re.match` vs `re.search` with no match

`in` vs `re.match` vs `re.search` with match in different position

1 Answers1

The problem in general

The specific question about comparing the two methods:

Suggested course of action

Minimizing the test data by using math

Linked

In vs regular expressions for checking against a word blacklist -- estimating expected production performance

Code

in vs re.match vs re.search with no match

in vs re.match vs re.search with match in different position

1 Answers1

The problem in general

The specific question about comparing the two methods:

Suggested course of action

Minimizing the test data by using math

Linked

`in` vs `re.match` vs `re.search` with no match

`in` vs `re.match` vs `re.search` with match in different position