Anzeige

introduction_to_antlr 3.ppt

25. Mar 2023
Anzeige

Más contenido relacionado

Anzeige

introduction_to_antlr 3.ppt

  1. 25 March 2023 Hande Çelikkanat 1 A (Long) Introduction to AntLR Slides adapted from: –AntLR Reference Manual by Terence Pratt antlr.org/share/1084743321127/ANTLR_Reference_Manual.pdf –AntLR Tutorial by Ashley J.S Mills http://supportweb.cs.bham.ac.uk/docs/tutorials/docsystem/build/tutorials/antlr/antlrhome.html –An Introduction to AntLR by Terence Pratt http://www.cs.usfca.edu/~parrt/course/652/lectures/antlr.html –An AntLR Tutorial by Scott Stanchfield javadude.com/articles/antlrtut/
  2. 2 AntLR ANother Tool for Language Recognition (or anti-LR??) a LL(k) parser and translator generator tool which can create – lexers – parsers – abstract syntax trees (AST’s) in which you describe the language grammatically and in return receive a program that can recognize and translate that language
  3. 3 Tasks Divided • Lexical Analysis (scanning) • Semantic Analysis (parsing) • Tree Generation • Code Generation
  4. 4 Lexer A source file is streamed to a lexer on a character by character basis by some kind of input interface. Lexer groups characters into meaningful tokens that are meaningful to the parser. A “token” may be – keywords – identifiers – symbols – operators Lexer also removes comments and whitespace from the program, which are meaningless to the parser. So it creates a stream of tokens, which are received one by one by the parser.
  5. 5 Parser Parser organizes the tokens into the allowed sequences defined by the grammar of the language. If the parser encounters a sequence of tokens that match none of the allowed sequences of tokens, it will issue an error A design choice is whether to try to recover from the error by making assumptions. Parsers may either do syntax-directed translation on-the-fly, or convert the sequences of tokens into an Abstract Syntax Tree (AST). An AST is a structure which – keeps information in an easily traversable form (such as operator at a node, operands at children of the node) – ignores form-dependent superficial details More on AST’s later... Parser also generates one or more symbol table(s) which contain information, about the tokens it encounters.
  6. 6 What does a grammar file look like? It is composed of rules ANTLR accepts three types of grammar specifications parsers lexers tree-parsers (also called tree-walkers) Uses LL(k) analysis for all So the grammar specifications are similar, and the generated lexers and parsers behave similarly
  7. 7 Sample File taken from AntLR tutorial of Ashley J.S Mills
  8. 8 Sample File Divided (1/3) • An arbitrary number of parsers, lexers, and tree- parsers in a grammar file – a separate class file will be generated for each – i.e, YourLexerClass.class, YourParserClass.class, YourTreeParserClass.class • Header: – put preamble that will be put on top of each of these classes – an import, maybe?
  9. 9 Sample File Divided (2/3) • Options – file-wide – charVocabulary = '0'..'377'; //defines the alphabet (usage in complement and wildcard) – k=2; // means two characters of lookahead • Class specific: { ... header for parser class only ...} class MyParser extends Parser; options { ...parser options... } { parser class members } parser rules
  10. 10 • Rules in EBNF notation: Sample File Divided (3/3) taken from AntLR tutorial of Ashley J.S Mills You simply list a set of lexical rules that match tokens. The tool automatically generates code to map the next input character(s) to a rule likely to match. A big "switch“ that routes recognition flow to the appropriate rule
  11. 11 Symbols in AntLR taken from AntLR reference manual
  12. 12 Lexer With one restriction: • Rules defined within a lexer grammar must have a name beginning with an uppercase letter taken from AntLR tutorial of Ashley J.S Mills
  13. 13 Lexer Rules You can define operators like: BECOMES : “:=“; COLON : ‘:‘; SEMI : ‘;’ ; EQUALS : ‘=‘ ; LBRACKET : ‘[‘; RBRACKET : ‘]’ ; LPAREN : ‘(‘ ; RPAREN : ‘)’ ; LT : ‘<‘ ; LTE : “<=“ ; PLUS : ‘+’ ; MINUS : ‘-’ ; TIMES : ‘*’ ; DIV : ‘/’ ; And then you can define a token class such as: OPS : (PLUS | MINUS | MULT | DIV) ;
  14. 14 Actions Blocks of source code (expressed in the target language) enclosed in curly braces Executed after the preceding production element has been recognized before the recognition of the following element Typically used to generate output, construct trees, or modify a symbol table Position dictates when it is recognized relative to the surrounding grammar elements. If the first element of a production, it is executed before any other element in that production, but only if that production is predicted by the lookahead rule_name ( {init-action}: {action of 1st production} production_1 | {action of 2nd production} production_2 )? The init-action would be executed regardless of what (if anything) matched in the optional subrule. The init-actions are placed within the loops generated for subrules (...)+ and (...)*.
  15. 15 Tip: Skipping Tokens A white space has nothing to do in a grammar: WS : (‘ ‘ | ‘n’ | ‘t’) { $setType(Token.SKIP); } → action ; → Do not pass this token to the parser. Recognize it and then throw it away. Same for comments ;)
  16. 16 Tip: Newline Stuff Line number of input is used for reporting error Must be incremented by hand when lexer encounters a newline: WS : ( ' ' | 't' | 'f' // handle newlines | ( "rn" // DOS/Windows | 'r' // Macintosh | 'n' // Unix ) // increment the line count { newline(); } → action executed only in this case ) { $setType(Token.SKIP); } ;
  17. 17 Parser class ExprParser extends Parser; expr: mexpr ((PLUS|MINUS) mexpr)* ; mexpr : atom (STAR atom)* ; atom: INT | LPAREN expr RPAREN ; • Rules defined within a parser grammar must have a name beginning with a lowercase letter
  18. 18 Tip: Keywords and Literals (1/2) Many languages have a general "identifier" lexical rule, and keywords that are special cases of the identifier pattern A typical identifier token may be defined as: ID : LETTER (LETTER | DIGIT)*; So how can AntLR understand “if” is not an identifier? You put fixed keywords into a literals table. checked after each token is matched Any double-quoted string used in a parser is automatically entered into the literals table of the associated lexer. subprogramBody : (basicDecl)* (procedureDecl)* "begin" (statement)* "end" IDENT ;
  19. 19 Tip: Keywords and Literals (2/2) option testLiterals By default, ANTLR will generate code in all lexer rules to test each token against the literals table However, you may suppress this code generation in the lexer by using a grammar option: class L extends Lexer; options { testLiterals=false; } ... If you turn this option off for a lexer, you may re-enable it for specific rules ID options { testLiterals=true; } : LETTER (LETTER | DIGIT)*;
  20. 20 Tip: Token Object Creation You will sometimes want to access information about the token being matched Label lexical rules and obtain a Token object representing the text, token type, line number, etc... matched for that rule reference Lexer rule: INT : ('0'..'9')+ ; Parser rule: INDEX : '[' i:INT ']' {System.out.println(i.getText());} ;
  21. 21 Tip: Syntactic / Semantic Predicates There are other situations where you have to turn on and off certain rules depending on prior context or semantic information Use “predicates” to decide
  22. 22 Syntactic Predicates ANTLR (tree) parsers usually use only a single symbol of lookahead, which is normally not a problem as intermediate forms are explicitly designed to be easy to walk However, there is occasionally the need to distinguish between similar tree structures Syntactic predicates can be used to overcome the limitations of limited fixed lookahead For example, distinguishing between the unary and binary minus operator: expr: ( #(MINUS expr expr) )=> #( MINUS expr expr ) | #( MINUS expr ) ... ; The order of evaluation is very important as the second alternative is a "subset" of the first alternative Syntactic predicates are a form of selective backtracking and, therefore, actions are turned off while evaluating a syntactic predicate so that actions do not have to be undone
  23. 23 Semantic Predicates Semantic predicates – at the start of an alternative: decides whether or not to match – in the middle of productions: throw exceptions when they evaluate to false stat: {isTypeName(LT(1))}? ID ID ";“ // declaration "type varName;" | ID "=" expr ";" // assignment ; decl: "var" ID ":" t:ID { isTypeName(t.getText()) }? //used to throw an exception ;
  24. 24 Eg: Keeping State Information Context-sensitive recognition example: If you are matching tokens that separate rows of data such as "----", you probably only want to match this if the "begin table" sequence has been found BEGIN_TABLE : '[' {this.inTable=true;} // enter table context ; ROW_SEP : {this.inTable}? "----“ // sematic predicate ; END_TABLE : ']' {this.inTable=false;} // exit table context ;
  25. 25 The Java Code The code to invoke the parser: import java.io.*; class Main { public static void main(String[] args) { try { // use DataInputStream to grab bytes MyLexer lexer = new MyLexer(new DataInputStream(System.in)); MyParser parser = new MyParser(lexer); int x = parser.expr(); System.out.println(x); } catch(Exception e) { System.err.println("exception: "+e); } } }
  26. 26 Running AntLR In Linux runantlr <antlr_file>.g javac *.java java Main In Windows Eclipse has a very easy-to-use plugin for AntLR http://antlreclipse.sourceforge.net/ for very very detailed instructions The plugin will run AntLR on the grammar file
  27. 27 Expression Evaluation 1: Syntax-Directed Translation To evaluate the expressions on the fly as the tokens come in, add actions to the parser: class ExprParser extends Parser; expr returns [int value=0] {int x;} : value=mexpr ( PLUS x=mexpr {value += x;} | MINUS x=mexpr {value -= x;} )* ; mexpr returns [int value=0] {int x;} : value=atom ( STAR x=atom {value *= x;} )* ; atom returns [int value=0] : i:INT {value=Integer.parseInt(i.getText());} | LPAREN value=expr RPAREN ;
  28. 28 Expression Evaluation 2: via AST Intermediate Form A more powerful strategy than syntax-directed translation is to build an AST: intermediate representation that holds all or most of the input symbols and has encoded, in the structure of the data, the relationship between those tokens For this kind of tree, you will use a tree walker to compute the same values as before, but using a different strategy The utility of ASTs becomes clear when you must do multiple walks over the tree to figure out what to compute or to do tree rewrites, morphing the tree towards another language.
  29. 29 Abstract Syntax Trees Abstract Syntax Tree: Like a parse tree, without unnecessary information Two-dimensional trees that can encode the structure of the input as well as the input symbols Either homogeneous: all objects of the same type; e.g., CommonAST in ANTLR or heterogeneous: multiple types such as PlusNode, MultNode... An AST for (3+4) might be represented as No parantheses are included in the tree!
  30. 30 AST Construction To get ANTLR to generate a useful AST : – turn on the buildAST option – add a few suffix operators class ExprParser extends Parser; options { buildAST=true; } expr: mexpr ((PLUS^|MINUS^) mexpr)* ; mexpr : atom (STAR^ atom)* ; atom: INT | LPAREN! expr RPAREN! ; No changes in the Lexer.
  31. 31 AST Operators AST root operator Normally AntLR makes the first token it encounters the root of the tree We usually want to manipulate this, eg, for operators A token suffixed with the “^” root operator forces that token as the root of the current tree: expr: mexpr ((PLUS^|MINUS^) mexpr)* ; AST exclude operator. Tokens / rule references suffixed with the exclude operator are not included in the AST eg, for parantheses: atom: INT | LPAREN! expr RPAREN! ;
  32. 32 AST Parsing and Evaluation Rule format is like #(A B C); which means "match a node of type A, and then descend into its list of children and match B and C". This notation can be nested arbitrarily, using #(...) for child trees eg, #(A B #(C D) ); class ExprTreeParser extends TreeParser; expr returns [int r=0] { int a,b; } : #(PLUS a=expr b=expr) {r = a+b;} | #(MINUS a=expr b=expr) {r = a-b;} | #(STAR a=expr b=expr) {r = a*b;} | i:INT {r = (int)Integer.parseInt(i.getText());} ; Important: Sufficient matches are not exact matches. As long as the tree satistfies the pattern, a match is reported, regardless of how much is left unparsed #( A B ) = #( A #(B C) D).
  33. 33 in Java The code to launch the parser and the tree walker: import java.io.*; import antlr.CommonAST; import antlr.collections.AST; class Calc { public static void main(String[] args) { try { CalcLexer lexer = new CalcLexer(new DataInputStream(System.in)); CalcParser parser = new CalcParser(lexer); parser.expr(); // Parse the input expression CommonAST t = (CommonAST)parser.getAST(); System.out.println(t.toStringList()); // Print the resulting tree out in LISP notation CalcTreeWalker walker = new CalcTreeWalker(); // Traverse the tree created by the parser int r = walker.expr(t); System.out.println("value is "+r); } catch(Exception e) { System.err.println("exception: "+e); } } }
  34. 34 AST Construction by Hand In some cases, you may want to transfom a tree yourself. eg, Optimization of addition with zero class CalcTreeWalker extends TreeParser; options{ buildAST = true; // "transform" mode } expr: ! #(PLUS left:expr right:expr) // '!' turns off auto transform { if ( #right.getType()==INT && Integer.parseInt(#right.getText())==0 ) // x+0 = x { #expr = #left; } else if ( #left.getType()==INT && Integer.parseInt(#left.getText())==0 ) // 0+x = x { #expr = #right; } else // x+y { #expr = #(PLUS, left, right); } } | #(STAR expr expr) // use auto transformation | i:INT ;
  35. 35 in Java The code to launch the parser and tree trasformer is: import java.io.*; import antlr.CommonAST; import antlr.collections.AST; class Calc { public static void main(String[] args) { try { CalcLexer lexer = new CalcLexer(new DataInputStream(System.in)); CalcParser parser = new CalcParser(lexer); parser.expr(); // Parse the input expression CommonAST t = (CommonAST)parser.getAST(); System.out.println(t.toLispString()); // Print the resulting tree out in LISP notation CalcTreeWalker walker = new CalcTreeWalker(); walker.expr(t); // Traverse the tree created by the parser t = (CommonAST)walker.getAST(); // Get the result tree from the walker System.out.println(t.toLispString()); } catch(Exception e) { System.err.println("exception: "+e); } } }
  36. 36 Left Recursion Solved E → E + T | T written in AntLR as expr: expr PLUS term | term; The code generated checks for expr infinitely: expr() { expr(); match(PLUS); expr(); } Eliminate left recursion by E → TE’ E’ → +TE’ | ε results in: expr: term (PLUS term)* ;
  37. 37 Links • AntLR Reference Manual by Terence Pratt antlr.org/share/1084743321127/ANTLR_Reference_Manual.pdf • AntLR Tutorial by Ashley J.S Mills http://supportweb.cs.bham.ac.uk/docs/tutorials/docsystem/build/tutorials/an tlr/antlrhome.html • An Introduction to AntLR by Terence Pratt http://www.cs.usfca.edu/~parrt/course/652/lectures/antlr.html • An AntLR Tutorial by Scott Stanchfield javadude.com/articles/antlrtut/
Anzeige