I want to search a large string for all the locations of a string.
4 Answers
The two other answers are correct but they are very slow and have O(N^2) complexity. But there is the Knuth-Morris-Pratt algorithm, which finds all substrings in O(N) complexity.
Edit:
Also, there is another algorithm: the so-called "Z-function" with O(N) complexity, but I couldn't find an English source for this algorithm (maybe because there is also another more famous one with same name - the Z-function of Rieman), so I will just put its code here and explain what it does.
void calc_z (string &s, vector<int> & z)
{
    int len = s.size();
    z.resize (len);
    int l = 0, r = 0;
    for (int i=1; i<len; ++i)
        if (z[i-l]+i <= r)
            z[i] = z[i-l];
        else
        {
            l = i;
            if (i > r) r = i;
            for (z[i] = r-i; r<len; ++r, ++z[i])
                if (s[r] != s[z[i]])
                    break;
            --r;
        }
}
int main()
{
    string main_string = "some string where we want to find substring or sub of string or just sub";
    string substring = "sub";
    string working_string = substring + main_string;
    vector<int> z;
    calc_z(working_string, z);
    //after this z[i] is maximal length of prefix of working_string
    //which is equal to string which starting from i-th position of
    //working_string. So the positions where z[i] >= substring.size()
    //are positions of substrings.
    for(int i = substring.size(); i < working_string.size(); ++i)
        if(z[i] >=substring.size())
            cout << i - substring.size() << endl; //to get position in main_string
}
 
    
    - 2,488
- 1
- 14
- 29
 
    
    - 10,810
- 14
- 61
- 111
- 
                    hmm, do you not think that the writers of `std::string::find` may have implemented KMP underneath? In fact, with the advent of the new SSE4.2 instructions, this type of optimization may already be available in the low-level functions (such as `strstr` in glibc) - and unless you use the instructions in your own code, highly unlikely that you'll beat the performance of those lower-level functions. End result, unless you can prove empirically it's really slow, don't re-invent... m2c – Nim Apr 28 '11 at 10:47
- 
                    4You can test yourself. Let's consider main_string = "a"(1000000 times)+"b"+"a"(1000000 times). and substring = "a"(999999 times). Using std::string::find and the code of @Kiril Kirov your program will work 2-3 days, but my one will return imediately. – Mihran Hovsepyan Apr 28 '11 at 11:10
- 
                    You may also want to consider using the [Boyer-Moore](http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) algorithm, it 's O(N) and performs better with a longer needle (substring). – Hasturkun Apr 28 '11 at 11:31
- 
                    4It's a bit nonsensical to describe string-finding algorithms as O(N) or O(N*N) when there are really two parameters that determine the runtime, N_haystack and N_needle. Pretty much any algorithm is O(N_haystack) when N_needle = 1. Most are O(N_haystack * N_needle) or better, and you may be able to assume N_needle <= C. – MSalters Apr 28 '11 at 11:56
Using std::string::find. You can do something like:
std::string::size_type start_pos = 0;
while( std::string::npos != 
          ( start_pos = mystring.find( my_sub_string, start_pos ) ) )
{
    // do something with start_pos or store it in a container
    ++start_pos;
}
EDIT: Doh! Thanks for the remark, Nawaz! Better?
 
    
    - 37,467
- 22
- 115
- 187
- 
                    
- 
                    ha, nice one :) Here's what happens when I'm still half asleep :D Nope, I don't need it. Just at the first time, I thought that without it, I could miss the last char, if the `my_sub_string` is just a letter, that is placed at the end of the `mystring`. Thanks again! – Kiril Kirov Apr 28 '11 at 08:55
- 
                    
I'll add for completeness, there is another approach that is possible with std::search, works like std::string::find, difference is that you work with iterators, something like:
std::string::iterator it(str.begin()), end(str.end());
std::string::iterator s_it(search_str.begin()), s_end(search_str.end());
it = std::search(it, end, s_it, s_end);
while(it != end)
{
  // do something with this position..
  // a tiny optimisation could be to buffer the result of the std::distance - heyho..
  it = std::search(std::advance(it, std::distance(s_it, s_end)), end, s_it, s_end);
}
I find that this sometimes outperforms std::string::find, esp. if you represent your string as a vector<char>.
 
    
    - 33,299
- 2
- 62
- 101
Simply use std::string::find() which returns the position at which the substring was found, or std::string::npos if none was found.
Here is the documentation.
An here is the example taken from this documentation:
// string::find
#include <iostream>
#include <string>
using namespace std;
int main ()
{
  string str ("There are two needles in this haystack with needles.");
  string str2 ("needle");
  size_t found;
  // different member versions of find in the same order as above:
  found=str.find(str2);
  if (found!=string::npos)
    cout << "first 'needle' found at: " << int(found) << endl;
  found=str.find("needles are small",found+1,6);
  if (found!=string::npos)
    cout << "second 'needle' found at: " << int(found) << endl;
  found=str.find("haystack");
  if (found!=string::npos)
    cout << "'haystack' also found at: " << int(found) << endl;
  found=str.find('.');
  if (found!=string::npos)
    cout << "Period found at: " << int(found) << endl;
  // let's replace the first needle:
  str.replace(str.find(str2),str2.length(),"preposition");
  cout << str << endl;
  return 0;
}
 
    
    - 53,676
- 39
- 161
- 238
- 
                    @sehe: There was one in the linked documentation (which I now have recopied). A bit too late apparently ;) – ereOn Apr 28 '11 at 08:40
 
    