Java indexOf function more efficient than Rabin-Karp? Search Efficiency of Text

Question

I posed a question to Stackoverflow a few weeks ago about a creating an efficient algorithm to search for a pattern in a large chunk of text. Right now I am using the String function indexOf to do the search. One suggestion was to use Rabin-Karp as an alternative. I wrote a little test program as follows to test an implementation of Rabin-Karp as follows.

public static void main(String[] args) {
    String test = "Mary had a little lamb whose fleece was white as snow";

    String p = "was";
     long start  = Calendar.getInstance().getTimeInMillis();
     for (int x = 0; x < 200000; x++)
         test.indexOf(p);
     long end = Calendar.getInstance().getTimeInMillis();
     end = end -start;
     System.out.println("Standard Java Time->"+end);

    RabinKarp searcher = new RabinKarp("was");
    start  = Calendar.getInstance().getTimeInMillis();
    for (int x = 0; x < 200000; x++)
    searcher.search(test);
    end = Calendar.getInstance().getTimeInMillis();
    end = end -start;
    System.out.println("Rabin Karp time->"+end);

}

And here is the implementation of Rabin-Karp that I am using:

import java.math.BigInteger;
import java.util.Random;

public class RabinKarp {
private String pat; // the pattern // needed only for Las Vegas
private long patHash; // pattern hash value
private int M; // pattern length
private long Q; // a large prime, small enough to avoid long overflow
private int R; // radix
private long RM; // R^(M-1) % Q
static private long dochash = -1L;

public RabinKarp(int R, char[] pattern) {
    throw new RuntimeException("Operation not supported yet");
}

public RabinKarp(String pat) {
    this.pat = pat; // save pattern (needed only for Las Vegas)
    R = 256;
    M = pat.length();
    Q = longRandomPrime();

    // precompute R^(M-1) % Q for use in removing leading digit
    RM = 1;
    for (int i = 1; i <= M - 1; i++)
        RM = (R * RM) % Q;
    patHash = hash(pat, M);
}

// Compute hash for key[0..M-1].
private long hash(String key, int M) {
    long h = 0;
    for (int j = 0; j < M; j++)
        h = (R * h + key.charAt(j)) % Q;
    return h;
}

// Las Vegas version: does pat[] match txt[i..i-M+1] ?
private boolean check(String txt, int i) {
    for (int j = 0; j < M; j++)
        if (pat.charAt(j) != txt.charAt(i + j))
            return false;
    return true;
}

// check for exact match
public int search(String txt) {
    int N = txt.length();
    if (N < M)
        return -1;
    long txtHash;
    if (dochash == -1L) {
        txtHash = hash(txt, M);
        dochash = txtHash;
    } else
        txtHash = dochash;

    // check for match at offset 0
    if ((patHash == txtHash) && check(txt, 0))
        return 0;

    // check for hash match; if hash match, check for exact match
    for (int i = M; i < N; i++) {
        // Remove leading digit, add trailing digit, check for match.
        txtHash = (txtHash + Q - RM * txt.charAt(i - M) % Q) % Q;
        txtHash = (txtHash * R + txt.charAt(i)) % Q;

        // match
        int offset = i - M + 1;
        if ((patHash == txtHash) && check(txt, offset))
            return offset;
    }

    // no match
    return -1; // was N
}

// a random 31-bit prime
private static long longRandomPrime() {
    BigInteger prime = new BigInteger(31, new Random());
    return prime.longValue();
}

// test client

}

The implementation of Rabin-Karp works in that it returns the correct offset of the string I am looking for. What is surprising to me though is the timing statistics that occurred when I ran the test program. Here they are:

Standard Java Time->39
Rabin Karp time->409

This was really surprising. Not only is Rabin-Karp (at least as it is implemented here) not faster than the standard java indexOf String function, it is slower by an order of magnitude. I don't know what is wrong (if anything). Any one have thoughts on this?

Thanks,

Elliott

btw, `BigInteger(31, new Random())` does not return prime (at least sometimes) — Victor Sorokin, Mar 16 '12 at 16:46
I didn't thoroughly look through the code nor do I know the Rabin-Karb algorithm but `in a large chunk of text` implies the algorithm is optimized for large texts and might be slower for smaller ones. Did you try with a substantially larger text as well? — Thomas, Mar 16 '12 at 16:46
have you tried it using a very large haystack? it's possible that its performance benefits only show up at larger scale. — Alexander Corwin, Mar 16 '12 at 16:46
It's also possible that `indexOf`, has an advantage because of being a native method (probably implemented directly into the VM) while your Rabin Karp methos doesn't run natively. And I'm just speculating. (sorry about that) — lmcanavals, Mar 16 '12 at 16:50
@MartínCanaval As far as I could see (looking at the Java 6 source) `indexOf` is not implemented natively. It's just plain old Java. — Thomas, Mar 16 '12 at 16:53
Another thing you should be careful of is smart JVMs. You could certainly make a VM that would know that String methods have no side effects, and thus optimize out the calls entirely (after all, it's not like you're even reading the result so why should the JVM waste time computing it?). Similarly you use the same haystack every single time, and a smart VM could also optimize that. Now I doubt either of these optimizations would be allowed by the VM spec but this just isn't an effective test for many reasons. — Mark Peters, Mar 16 '12 at 16:53
See also: http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java — Mark Peters, Mar 16 '12 at 16:56

score 21 · Accepted Answer · answered Mar 17 '12 at 00:24

I answered this question earlier and Elliot pointed out I was just plain wrong. I apologise to the community.

There is nothing magical about the String.indexOf code. It is not natively optimised or anything like that. You can copy the indexOf method from the String source code and it runs just as quickly.

What we have here is the difference between O() efficiency and actual efficiency. Rabin-Karp for a String of length N and a pattern of length M, Rabin-Karp is O(N+M) and a worst case of O(NM). When you look into it, String.indexOf() also has a best case of O(N+M) and a worst case of O(NM).

If the text contains many partial matches to the start of the pattern Rabin-Karp will stay close to its best-case performance, whilst String.indexOf will not. For example I tested the above code (properly this time :-)) on a million '0's followed by a single '1', and the searched for 1000 '0's followed by a single '1'. This forced the String.indexOf to its worst case performance. For this highly degenerate test, the Rabin-Karp algorithm was about 15 times faster than indexOf.

For natural language text, Rabin-Karp will remain close to best-case and indexOf will only deteriorate slightly. The deciding factor is therefore the complexity of operations performed on each step.

In it's innermost loop, indexOf scans for a matching first character. At each iteration is has to:

increment the loop counter
perform two logical tests
do one array access

In Rabin-Karp each iteration has to:

increment the loop counter
perform two logical tests
do two array accesses (actually two method invocations)
update a hash, which above requires 9 numerical operations

Therefore at each iteration Rabin-Karp will fall further and further behind. I tried simplifying the hash algorithm to just XOR characters, but I still had an extra array access and two extra numerical operations so it was still slower.

Furthermore, when a match is find, Rabin-Karp only knows the hashes match and must therefore test every character, whereas indexOf already knows the first character matches and therefore has one less test to do.

Having read on Wikipedia that Rabin-Karp is used to detect plagiarism, I took the Bible's Book of Ruth, removed all punctuation and made everything lower case which left just under 10000 characters. I then searched for "andthewomenherneighboursgaveitaname" which occurs near the very end of the text. String.indexOf was still faster, even with just the XOR hash. However, if I removed String.indexOfs advantage of being able to access String's private internal character array and forced it to copy the character array, then, finally, Rabin-Karp was genuinely faster.

However, I deliberately chose that text as there are 213 "and"s in the Book of Ruth and 28 "andthe"s. If instead I searched just for the last characters "ursgaveitaname", well there are only 3 "urs"s in the text so indexOf returns closer to its best-case and wins the race again.

As a fairer test I chose random 20 character strings from the 2nd half of the text and timed them. Rabin-Karp was about 20% slower than the indexOf algorithm run outside of the String class, and 70% slower than the actual indexOf algorithm. Thus even in the use case it is supposedly appropriate for, it was still not the best choice.

So what good is Rabin-Karp? No matter the length or nature of the text to be searched, at every character compared it will be slower. No matter what hash function we choose we are surely required to make an additional array access and at least two numerical operations. A more complex hash function will give us less false matches, but require more numerical operators. There is simply no way Rabin-Karp can ever keep up.

As demonstrated above, if we need to find a match prefixed by an often repeated block of text, indexOf can be slower, but if we know we are doing that it would look like we would still be better off using indexOf to search for the text without the prefix and then check to see if the prefix was present.

Based on my investigations today, I cannot see any time when the additional complexity of Rabin Karp will pay off.

Wow. Thanks for validating my results. When I saw your first post, I was thinking that I was doing something wrong. Thank you for your very thorough analysis of the problem. Elliott — Elliott, Mar 17 '12 at 01:03
Rabin-Karp will shine if you have multiple - or a lot - of patterns. Specifically, when the cost of the hash computation becomes less than the number of patterns. This was mentioned by Owen's answer, but I wanted to leave a comment on the accepted answer for other people arriving here from search results. — John Foley, Mar 14 '16 at 21:14

mindvirus · Answer 2 · 2012-03-16T17:50:27.497

6

Here is the source to java.lang.String. indexOf is line 1770.

My suspicion is since you are using it on such a short input string, the extra overhead of the Rabin-Karp algorithm over the seemly naive implementation of java.lang.String's indexOf, you aren't seeing the true performance of the algorithm. I would suggest trying it on a much longer input string to compare performance.

edited Mar 16 '12 at 17:50

answered Mar 16 '12 at 16:51

mindvirus

5,016
4
29
46

score 5 · Answer 3 · answered May 17 '12 at 08:55

From my understanding, Rabin Karp is best used when searching a block of text for mutiple words/phrases.

Think about a bad word search, for flagging abusive language.

If you have a list of 2000 words, including derivations, then you would need to call indexOf 2000 times, one for each word you are trying to find.

RabinKarp helps with this by doing the search the other way around. Make a 4 character hash of each of the 2000 words, and put that into a dictionary with a fast lookup.

Now, for each 4 characters of the search text, hash and check against the dictionary.

As you can see, the search is now the other way around - we're searching the 2000 words for a possible match instead. Then we get the string from the dictionary and do an equals to check to be sure.

It's also a faster search this way, because we're searching a dictionary instead of string matching.

Now, imagine the WORST case scenario of doing all those indexOf searches - the very LAST word we check is a match ...

The wikipedia article for RabinKarp even mentions is inferiority in the situation you describe. ;-) http://en.wikipedia.org/wiki/Rabin-Karp_algorithm

score 1 · Answer 4 · answered Mar 16 '12 at 16:51

But this is only natural to happen!
Your test input first of all is too trivial.

indexOf returns the index of was searching a small buffer (String's internal char array`) while the Rabin-Karp has to do preprocessing to setup its data to work which takes extra time.

To see a difference you would have to test in a really large text to find expressions.

Also please note that when using more sofisticated string search algorithm they can have "expensive" setup/preprocessing to provide the really fast search.
In your case you just search a was in a sentence. I any case you should always take the input into account

With larger strings of random numbers, my results show RK does even worse. See my comment above. — Elliott, Mar 16 '12 at 20:36

score 1 · Answer 5 · edited May 23 '17 at 12:32

1

Without looking into details, two reasons come to my mind:

you are very likely to outperform standard API implementations only for very special cases. I do not consider "Mary had a little lamb whose fleece was white as snow" to be such.
microbenchmarking is very difficult and can give quite misleading results. Here is an explanation, here a list of tools you could use

edited May 23 '17 at 12:32

Community

1
1

answered Mar 16 '12 at 16:55

DaveFar

7,078
4
50
90

score 0 · Answer 6 · answered Dec 09 '14 at 08:23

0

For this kind of search, Knuth-Morris-Pratt may perform better. In particular if the sub-string doesn't just repeat characters, then KMP should outperform indexOf(). Worst case (string of all the same characters) it will be the same.

answered Dec 09 '14 at 08:23

sofend

647
8
17

Kevin Welker · Answer 7 · 2012-03-16T17:16:43.947

0

Not only simply try a longer static string, but try generating random long strings and inserting the search target into a random location each time. Without randomizing it, you will see a fixed result for indexOf.

EDITED: Random is the wrong concept. Most text is not truly random. But you would need a lot of different long strings to be effective, and not just testing the same String multiple times. I am sure there are ways to extract "random" large Strings from an even larger text source, or something like that.

edited Mar 16 '12 at 17:16

answered Mar 16 '12 at 16:56

Kevin Welker

7,719
1
40
56

That's an even more meaningless metric - how often do *you* do substring-searches on completely random strings, in the real world? – BlueRaja - Danny Pflughoeft Mar 16 '12 at 17:06

Java indexOf function more efficient than Rabin-Karp? Search Efficiency of Text

7 Answers7

Linked