C++ substr, optimizing speed

Question

So before few days I started learning C++. I'm writing a simple xHTML parser, which doesn't contain nested tags. For testing I have been using the following data: http://pastebin.com/bbhJHBdQ (around 10k chars). I need to parse data only between p, h2 and h3 tags. My goal is to parse the tags and its content into the following structure:

struct Node {
    short tag; // p = 1, h2 = 2, h3 = 3
    std::string data;
};

for example <p> asdasd </p> will be parsed to tag = 1, string = "asdasd". I don't want to use third party libs and I'm trying to do speed optimizations.

Here is my code:

short tagDetect(char * ptr){
    if (*ptr == '/') {
        return 0;
    }

    if (*ptr == 'p') {
        return 1;
    }

    if (*(ptr + 1) == '2')
        return 2;

    if (*(ptr + 1) == '3')
        return 3;

    return -1;
}


struct Node {
    short tag;
    std::string data;

    Node(std::string input, short tagId) {
        tag = tagId;
        data = input;
    }
};

int _tmain(int argc, _TCHAR* argv[])
{
    std::string input = GetData(); // returns the pastebin content above
    std::vector<Node> elems;

    String::size_type pos = 0;
    char pattern = '<';

    int openPos;
    short tagID, lastTag;

    double  duration;
    clock_t start = clock();

    for (int i = 0; i < 20000; i++) {
        elems.clear();

        pos = 0;
        while ((pos = input.find(pattern, pos)) != std::string::npos) {
            pos++;
            tagID = tagDetect(&input[pos]);
            switch (tagID) {
            case 0:
                if (tagID = tagDetect(&input[pos + 1]) == lastTag && pos - openPos > 10) {
                    elems.push_back(Node(input.substr(openPos + (lastTag > 1 ? 3 : 2), pos - openPos - (lastTag > 1 ? 3 : 2) - 1), lastTag));
                }

                break;
            case 1:
            case 2:
            case 3:
                openPos = pos;
                lastTag = tagID;
                break;
            }
        }

    }

    duration = (double)(clock() - start) / CLOCKS_PER_SEC;
    printf("%2.1f seconds\n", duration);
}

My code is in loop in order to performance test my code. My data contain 10k chars.

I have noticed that the biggest "bottleneck" of my code is the substr. As presented above, the code finishes executing in 5.8 sec. I noticed that if I reduce the strsub len to 10, the execution speed gets reduce to 0.4 sec. If I replace the whole substr with "" my code finishes in 0.1 sec.

My questions are:

How can I optimize the substr, because it's the main bottleneck to my code?
Are there any other optimization I can make to my code?

I'm not sure if this question is fine for SO, but I'm pretty new in C++ and I don't have idea who to ask if my code is complete crap.

Full source code can be found here: http://pastebin.com/dhR5afuE

Just [*random pause*](http://stackoverflow.com/a/378024/23771) it, and you will see why `substr` is taking so much time. My guess is it is constructing a new string to return, and that means calling `new`, which is costly. Then it's constructing a `Node` instance, and then it's doing `push_back`, which can also do more memory reallocating, unless you reserve it big enough to start with. If that's what you see, figure out how to do less memory-bonking. — Mike Dunlavey, Apr 17 '14 at 01:06
Try changing the `Node` constructor to: `Node(const std::string& input, short tagId) : tag(tagId), data(input) {}` — rici, Apr 17 '14 at 03:30
@rici, changing the constructor improved my code speed with about 1/3 ! Thanks! Do you have any other suggestions? I will really appreciate it. — Deepsy, Apr 17 '14 at 14:18

score 3 · Answer 1 · answered Apr 17 '14 at 00:19

3

Instead of storing substrings, you could store data which refers to sections in the original string (either via pointers, iterators or integer indexes). You just have to be careful that the original string stays intact for as long as the reference data is used. Take a hint from boost::string_ref even if you're unwilling to use it directly.

answered Apr 17 '14 at 00:19

Benjamin Lindley

101,917
9
204
274

The thing is that I need to export the data to `NodeJS` so as long as my parser is done I need to send the data. – Deepsy Apr 17 '14 at 00:37
Deepsy, store the references as he says, and then do a single pass over the elements you find at the end to perform the actual string copy of the segments. – Catalyst May 09 '14 at 20:24

user207421 · Answer 2 · 2014-04-17T00:23:02.820

2

There are better substring algorithms than just a linear search, which is O(MxN). Look up the Boyer-Moore and Knuth-Morris-Platt algorithms. I tested these years ago and B-M won.

You could also consider using a regular expression, which is more expensive to set up but could be more efficient in the actual search than your linear search.

edited Apr 17 '14 at 00:23

answered Apr 17 '14 at 00:07

user207421

305,947
44
307
483

considering the string he's looking for is 3 known characters, it's probably faster to do a naive search. – Mooing Duck Apr 17 '14 at 00:19
@MooingDuck It probably isn't. I look forward to your benchmark. – user207421 Apr 17 '14 at 00:21
The more I think about it, the more I think you're probably right – Mooing Duck Apr 17 '14 at 00:45
http://coliru.stacked-crooked.com/a/517c60cb99553875 (bad test with many variables) – Mooing Duck Apr 17 '14 at 00:50

C++ substr, optimizing speed

2 Answers2