parsing - Ordering lexer rules in a grammar using ANTLR4 -
i'm using antlr4 generate parser. new parser grammars. i've read helpful antlr mega tutorial still stuck on how order (and/or write) lexer , parser rules.
i want parser able handle this:
hello << name >>, how you?
at runtime replace "<< name >>" user's name.
so parsing text words (and punctuation, symbols, etc), except occasional "<< >>" tag, calling "func" in lexer rules.
here grammar:
doc: item* eof ; item: (func | word) punct? ; func: '<<' id '>>' ; ws : [ \t\n\r] -> skip ; fragment letter : [a-za-z] ; fragment digit : [0-9] ; fragment char : (letter | digit | symb ) ; word : char+ ; id: letter ( letter | digit)* ; punct : [.,?!] ; fragment symb : ~[a-za-z0-9.,?! |{}<>] ;
side note: added "punct?" @ end of "item" rule because possible, such in example sentence gave above, have comma appear right after "func". since can have comma after "word" decided put punctuation in "item" instead of in both of "func" , "word".
if run parser on above sentence, parse tree looks this:
anything highlighted in red parse error.
so not recognizing "id" inside double angle brackets "id". presumably because "word" comes first in list of lexer rules. however, have no rule says "<< word >>", rule says "<< id >>", i'm not clear on why happening.
if swap order of "id" , "word" in grammar, in order:
id: letter ( letter | digit)* ; word : char+ ;
and run parser, parse tree this:
so "func" , "id" rules being handled appropriately, none of "word"s being recognized.
how past conundrum?
i suppose 1 option might change "func" rule "<< word >>" , treat words, doing away "id". wanted differentiate text word variable identifier (for instance, no special characters allowed in variable identifier).
thanks help!
from the definitive antlr 4 reference :
antlr resolves lexical ambiguities matching input string rule specified first in grammar.
with grammar (in question.g4) , t.text file containing
hello << name >>, how @ 9 o'clock?
the execution of
$ grun question doc -tokens -diagnostics t.text
gives
[@0,0:4='hello',<word>,1:0] [@1,6:7='<<',<'<<'>,1:6] [@2,9:12='name',<word>,1:9] [@3,14:15='>>',<'>>'>,1:14] [@4,16:16=',',<punct>,1:16] [@5,18:20='how',<word>,1:18] [@6,22:24='are',<word>,1:22] [@7,26:28='you',<word>,1:26] [@8,30:31='at',<word>,1:30] [@9,33:36='nine',<word>,1:33] [@10,38:44='o'clock',<word>,1:38] [@11,45:45='?',<punct>,1:45] [@12,47:46='<eof>',<eof>,2:0] line 1:9 mismatched input 'name' expecting id line 1:14 extraneous input '>>' expecting {<eof>, '<<', word, punct}
now change word
word
in item
rule, , add word
rule :
item: (func | word) punct? ; word: word | id ;
and put id before word :
id: letter ( letter | digit)* ; word : char+ ;
the tokens now
[@0,0:4='hello',<id>,1:0] [@1,6:7='<<',<'<<'>,1:6] [@2,9:12='name',<id>,1:9] [@3,14:15='>>',<'>>'>,1:14] [@4,16:16=',',<punct>,1:16] [@5,18:20='how',<id>,1:18] [@6,22:24='are',<id>,1:22] [@7,26:28='you',<id>,1:26] [@8,30:31='at',<id>,1:30] [@9,33:36='nine',<id>,1:33] [@10,38:44='o'clock',<word>,1:38] [@11,45:45='?',<punct>,1:45] [@12,47:46='<eof>',<eof>,2:0]
and there no more error. -gui graphic shows, have branches identified word
or func
.
Comments
Post a Comment