Ambiguous Lexer rules in Antlr

Question

I have an antlr grammar with multiple lexer rules that match the same word. It can't be resolved during lexing, but with the grammar, it becomes unambiguous.

Example:

conversion: NUMBER UNIT CONVERT UNIT;
NUMBER: [0-9]+;
UNIT: 'in' | 'meters' | ......;
CONVERT: 'in';

Input: 1 in in meters

The word "in" matches the lexer rules UNIT and CONVERT.

How can this be solved while keeping the grammar file readable?

BernardK · Answer 1 · 2017-11-16T01:25:20.127

When an input matches two lexer rules, ANTLR chooses either the longest or the first, see disambiguate. With your grammar, in will be interpreted as UNIT, never CONVERT, and the rule

conversion: NUMBER UNIT CONVERT UNIT;

can't work because there are three UNIT tokens :

$ grun Question question -tokens -diagnostics input.txt 
[@0,0:0='1',<NUMBER>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:3='in',<UNIT>,1:2]
[@3,4:4=' ',<WS>,channel=1,1:4]
[@4,5:6='in',<UNIT>,1:5]
[@5,7:7=' ',<WS>,channel=1,1:7]
[@6,8:13='meters',<UNIT>,1:8]
[@7,14:14='\n',<NL>,1:14]
[@8,15:14='<EOF>',<EOF>,2:0]
Question last update 0159
line 1:5 missing 'in' at 'in'
line 1:8 mismatched input 'meters' expecting <EOF>

What you can do is to have only ID or TEXT tokens and distinguish them with a label, like this :

grammar Question;

question
@init {System.out.println("Question last update 0132");}
    :   conversion NL EOF
    ;

conversion
    :   NUMBER unit1=ID convert=ID unit2=ID
        {System.out.println("Quantity " + $NUMBER.text + " " + $unit1.text +
         " to convert " + $convert.text + " " + $unit2.text);}
    ;

ID      : LETTER ( LETTER | DIGIT | '_' )* ; // or TEXT : LETTER+ ;     
NUMBER  : DIGIT+ ;

NL      : [\r\n] ;
WS      : [ \t] -> channel(HIDDEN) ; // -> skip ;

fragment LETTER : [a-zA-Z] ;
fragment DIGIT  : [0-9] ;

Execution :

$ grun Question question -tokens -diagnostics input.txt 
[@0,0:0='1',<NUMBER>,1:0]
[@1,1:1=' ',<WS>,channel=1,1:1]
[@2,2:3='in',<ID>,1:2]
[@3,4:4=' ',<WS>,channel=1,1:4]
[@4,5:6='in',<ID>,1:5]
[@5,7:7=' ',<WS>,channel=1,1:7]
[@6,8:13='meters',<ID>,1:8]
[@7,14:14='\n',<NL>,1:14]
[@8,15:14='<EOF>',<EOF>,2:0]
Question last update 0132
Quantity 1 in to convert in meters

Labels are available from the rule's context in the visitor, so it is easy to distinguish tokens of the same type.

Ben · Answer 2 · 2017-11-16T22:15:05.883

Based on the info in your question, it's hard to say what the best solution would be - I don't know what your lexer rules are, for example - nor can I tell why you have lexer rules that are ambiguous at all.

In my experience with antlr, lexer rules don't generally carry any semantic meaning; they are just text that matches some kind of regular expression. So, instead of having VARIABLE, METHOD_NAME, etc, I'd just have IDENTIFIER, and then figure it out at a higher level.

In other words, it seems (from the little I can glean from your question) that you might benefit either from replacing UNIT and CONVERT with grammar rules, or just having a single rule:

conversion: NUMBER TEXT TEXT TEXT

and validating the text values in your ANTLR listener/tree-walker/etc.

EDIT

Thanks for updating your question with lexer rules. It's clear now why it's failing - as BernardK points out, antlr will always choose the first matching lexer rule. This means it's impossible for the second of two ambiguous lexer rules to match, and makes your proposed design infeasible.

My opinion is that lexer rules are not the correct layer to do things like unit validation; they excel at structure, not content. Evaluating the parse tree will be much more practical than trying to contort an antlr grammar.

Finally, you might also do something with embedded actions on parse rules, like validating the value of an ID token against a known set of units. It could work, but would destroy the reusability of your grammar.

I added the lexer rules to the example to make it more clear. Catching all text in a lexer rule and parsing it manually in the visitor defeats the purpose of using Antlr. Moving the lexer rules to grammar rules would work, but would make the grammar hard to read: `unit: 'm' 'e' 't' 'e' 'r' | ... ` — Toast, Nov 16 '17 at 00:27

Ambiguous Lexer rules in Antlr

2 Answers2