Regular expression scan results don't register further regex hits?

Question

I am working on parsing pseudo-S-expressions with recursive regexes in Ruby.

After doing some searching, I started using the regular expressions used in the answer to "Matching balanced parenthesis in Ruby using recursive regular expressions like perl". The regex matches correctly, but the results are exhibiting strange behavior. If I try to use match on any of the results, those further results will match the entire tested string, no matter what the regex is used. If I explicitly override one of the initial results with a string literal, then match works as expected for that result. However, the class of the result entry undoubtedly claims that it is a plain vanilla string. What on earth is going on here?

src = "(def foo 10) (+ foo 4 12)"

def parse(exp)

     expression =%r{
      (?<re>
        \(
          (?:
        (?> [^()]+ )
        |
        \g<re>
          )*
        \)
      )
    }x
     trans = ""
     exp.scan(expression) {|m|
      m[0].match(/\d/) {|m|
          trans += m.string
     }
     } 
     return trans
end

Of course, this isn't even close to complete parsing code. I also know it's not a great idea to try to parse code robustly with regexes, but I'm not trying to make a robust solution, just a POC.

Does anyone know what's causing these regexes to misbehave?

This looks like an interesting question, but can you update it with an example of the specific output you are seeing vs. what you are expecting? — Peter Alfvin, Nov 26 '13 at 00:08
Have you considered a Parsing Expression Grammar like [TreeTop](http://treetop.rubyforge.org/)? — Mark Thomas, Nov 26 '13 at 02:44
@PeterAlfvin I'll update this question later today and verify your answer. — C. Warren Dale, Nov 26 '13 at 18:07
@MarkThomas The final version won't parse anything, it'll piggy-back on LISP macros. Right now I'm just focusing on target language structure rather than source language capabilities. — C. Warren Dale, Nov 26 '13 at 18:11

score 0 · Accepted Answer · answered Nov 26 '13 at 00:41

0

The method string from MatchData returns a "frozen copy of the string passed in to match", not what has been matched. per http://www.ruby-doc.org/core-2.0.0/MatchData.html#method-i-string

That's why you're getting the entire string returned, because you're adding each of the initial matches to trans.

You can confirm this by putting in print statements of the value of m within the innermost block. match is correctly matching 1, then 4.

answered Nov 26 '13 at 00:41

Peter Alfvin

28,599
8
68
106

You are correct that m.string does not return only the matched substring...however, that doesn't explain the other behavior, namely that string matches against regexes that it shouldn't match against at all. I'll put more comprehensive examples in the question. – C. Warren Dale Nov 27 '13 at 01:54
Actually, I won't do that. Your answer explains everything. I've been totally misinterpreting this behavior. Thank you! – C. Warren Dale Nov 27 '13 at 01:56

Regular expression scan results don't register further regex hits?

1 Answers1