I need to make this function run much faster (~20x faster) to meet a required benchmark. I have made quite a few improvements from my initial implementation, but am hitting a wall.
The basic problem is this: count case-insensitive occurrences of word in text.
Complicating criteria include:
- Must be a whole word (word"George" is not found intext"Georges")
- Single quotes are to be considered part of words, unless there are more than one in a row
- wordmay actually be a phrase (meaning it could have spaces, punctuation, etc.)
- Cannot use regex
My basic implementation has been to loop through each character in text, maintaining my position in word and if the character matches the corresponding character of word, I add it to a local string, advance my position in word and text, and go again. Once I have a match candidate (i.e. my local string is equal to word), I check the surrounding characters to ensure the match candidate is a full word, per rules 1 and 2 above. Note this checking does not occur frequently enough to materially impact the total time the algorithm takes.
Most successful optimizations I have made so far:
- Do string lowercasing and length measuring outside the loop
- Check that wordis at least a substring oftextotherwise return 0 immediately
- Don't bother checking for full word potential until we have a full match
- Count the number of occurrences up front (with no rules), and break out of the loop immediately if we meet that number
I've profiled the code line-by-line using pprofile, and the majority of my code's runtime is simple lines like incrementing the counter var, resetting the match_candidate string to "", indexing into strings, and if statements. I have not included the code for validate_full_match as it is not a significant time user.
Are there any low-hanging fruit I'm ignoring? A different approach entirely I should consider?
Thanks for any suggestions!
def count_occurences_in_text(word, text):
    """Number of occurences of word (case insensitive) in text
    Note that word can actually be any length of text, from a single
    character to a complete phrase; however, partial words do not
    count. For example:
    count_occurences_in_text("george", "I am Georges") returns 0
    while
    count_occurences_in_text("i am", "I am Georges") returns 1
    """
    # We perform some measurements and manipulation at the start to
    # avoid performing them repeatedly in the loop below
    text = text.lower()
    word = word.lower()
    max_matches = text.count(word)
    if max_matches == 0:
        return 0
    word_len = len(word)
    # Counter vars
    match_count = 0
    text_cursor = 0
    word_cursor = 0
    # We will build up match_candidate and check it against word
    match_candidate = ""
    for text_char in text:
        if text_char == word[word_cursor]:
            match_candidate += text_char
            if word == match_candidate:
                if validate_full_match(text, text_cursor, word_len):
                    match_count += 1
                    if match_count == max_matches:
                        break
                    word_cursor = 0
                    match_candidate = ""
            else:
                word_cursor += 1
        else:
            match_candidate = ""
            word_cursor = 0
        text_cursor += 1
    return match_count
 
    