I am trying to study regex from multiple sources, but came across a confusion about backtracking, because one defines backtracking, that it means the state when a regex engine fail to match a pattern, so it backtracks to the position where the first atom was matched, so for example,to match cat in He captured a catfish for his cat the engine will step like this:
- It will search for
cuntil it match atcincaptured - then the same for
a - but fail to match
twithp - the engine reset back to position after
cincaptured, because it knew that no match will happen before this point.
So in all cases with no quantifiers, the engine will be reset the whole pattern to try match it again from different position.
Another defines backtracking as the state of using a quantifier like .* so the regex engine will match the complete text which makes it fail so it backtracks one by by one until match occurs. i think these are not the same.
As stated here:
A fundamental feature of regular expression matching involves the notion called backtracking. which is used (when needed) by all regular expression quantifiers, namely *, *?, +, +?, {n,m}, and {n,m}?.
For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part-that's why it's called backtracking.
Which means a pattern like ([A-Z][A-Z0-9]*)\b[^>]*>.*<\/\1> to match Testing <B><I>bold italic</I></B> text. will work like this:
- It will first match the
<B> - Then the pattern need to match the
.*which will match up to the end of the string. - then it will try to match the
<but it has already reached the end so it will backtrack one character by one until it matches.
In contrast to the first cat example which completely resets the engine to the very first atom and start again from the position matched the first atom.
But in another situation if i add ? after .*, the regex skips this atom .*? trying to match the remaining characters, but if no match it falls back to the . to match until there is a < and then start matching atoms after it.
I think there are multiple definitions here, would any one explain which on of these cases is backtracking.