Switching lexer state in antlr3 grammar

Question

I'm trying to construct an antlr grammar to parse a templating language. that language can be embedded in any text and the boundaries are marked with opening/closing tags: {{ / }}. So a valid template looks like this:

    foo {{ someVariable }} bar

Where foo and bar should be ignored, and the part inside the {{ and }} tags should be parsed. I've found this question which basically has an answer for the problem, except that the tags are only one { and }. I've tried to modify the grammar to match 2 opening/closing characters, but as soon as i do this, the BUFFER rule consumes ALL characters, also the opening and closing brackets. The LD rule is never being invoked.

Has anyone an idea why the antlr lexer is consuming all tokens in the Buffer rule when the delimiters have 2 characters, but does not consume the delimiters when they have only one character?

    grammar Test;

    options { 
      output=AST;
      ASTLabelType=CommonTree; 
    }

    @lexer::members {
      private boolean insideTag = false;
    }

    start   
      :  (tag | BUFFER )*
      ;

    tag
      : LD IDENT^ RD
      ;

    LD @after {
      // flip lexer the state
      insideTag=true;
      System.err.println("FLIPPING TAG");
    } : '{{';

    RD @after {
      // flip the state back
      insideTag=false;
    } : '}}';

    SPACE    : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
    IDENT    : (LETTER)*;
    BUFFER   : { !insideTag }?=> ~(LD | RD)+;

    fragment LETTER : ('a'..'z' | 'A'..'Z');

Beware that `IDENT : (LETTER)*;` (might) cause the lexer to go in an infinite loop. Lexer rule _must_ always match at least 1 character. — Bart Kiers, Jan 01 '12 at 16:12

Bart Kiers · Accepted Answer · 2012-01-01T18:02:06.387

You can match any character once or more until you see {{ ahead by including a predicate inside the parenthesis ( ... )+ (see the BUFFER rule in the demo).

A demo:

grammar Test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

@lexer::members {
  private boolean insideTag = false;
}

start   
  :  tag EOF
  ;

tag
  : LD IDENT^ RD
  ;

LD 
@after {insideTag=true;} 
 : '{{'
 ;

RD 
@after {insideTag=false;} 
 : '}}'
 ;

BUFFER
 : ({!insideTag && !(input.LA(1)=='{' && input.LA(2)=='{')}?=> .)+ {$channel=HIDDEN;}
 ;

SPACE 
 : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
 ;

IDENT
 : ('a'..'z' | 'A'..'Z')+
 ;

Note that it's best to keep the BUFFER rule as the first lexer rule in your grammar: that way, it will be the first token that is tried.

If you now parse "foo {{ someVariable }} bar", the following AST is created:

enter image description here

score 0 · Answer 2 · answered Jan 08 '12 at 11:43

Wouldn't a grammar like this fit your needs? I don't see why the BUFFER needs to be that complicated.

grammar test;

options { 
  output=AST;
  ASTLabelType=CommonTree; 
}

@lexer::members { 
    private boolean inTag=false;
}

start   
  :  tag* EOF
  ;

tag
  : LD IDENT RD -> IDENT
  ;

LD 
@after { inTag=true; }
 : '{{'
 ;

RD 
@after { inTag=false; }
 : '}}'
 ;

IDENT   :   {inTag}?=> ('a'..'z'|'A'..'Z'|'_') 'a'..'z'|'A'..'Z'|'0'..'9'|'_')*
    ;

BUFFER
 : . {$channel=HIDDEN;}
 ;

Switching lexer state in antlr3 grammar

2 Answers2