The pug templating engine is designed to be easily extensible, as well as having powerful core functionality. It was recently re-written to be more modular and structured. This talk discusses how a compiler is built out of a series of independent stages.
14. Token Stream
• Next – get the next token and advance the stream
• Peek – get the next token without advancing the stream
• Expect(type) – get the next token and advance the stream
16. Parser
parseTag() {
let tok = this.tokens.expect('tag');
let body = null;
switch (this.tokens.peek().type) {
case 'text':
body = this.parseText();
break;
case 'indent':
body = this.parseBlock();
break;
}
return {type: 'Tag', name: tok.val, body};
}
24. Code Gen
function render(node) {
switch (node.type) {
case 'Block':
return renderBlock(node);
case 'Tag':
return renderTag(node);
case 'Text':
return renderText(node);
}
}
33. Lexer Parser Loader Linker Code-Gen
Now it's your turn!
String Tokens AST Collection
of ASTs
AST String
Hinweis der Redaktion
How many of you use one of these?
Raise your hand if you use Webpack or Browserify
[Walk]
Today I hope to give you a better idea of how these tools work, and give you the tools to build and contribute to them yourself.
Many of us write html every day
It's verbose
It lacks features for reusability and dynamic content
I maintain a language that used to be called "jade". Today I'm announcing that, along with the release of version 2.0.0, it has been renamed to "pug"
Pug is a simple language for producing HTML documents. It has a concise syntax, and supports features for code reuse and dynamic content.
There are three main stages to the pug compiler:
[walk]
Lex the source code to convert the string of text into a stream of tokens
Parse the tokens into a tree structure called an abstract syntax tree
Generate output code from the abstract syntax tree
The lexer splits the stream of characters from the source code into logical tokens
The first logical token here is the "article" tag, since we read that as a single unit of meaning
The lexer can be thought of as a simple state machine. For our language, it needs to keep track of:
The remaining source code that has not yet been lexed.
The current level of indentation
We'll define a function for each type of token.
Each function will:
Match the string against a regular expression
If it matches, it will consume those characters
Then return a token of the appropriate tag
If it does not match, we return `undefined` and don't consume any characters
For the tag, we simply match any number of characters in the range a to z
The function for text is mostly the same. We match a space, followed by any characters to the end of the line. We then remove the leading space from the value we store in the token.
Our handling of newlines is a little different. We'll use one function to track indents, outdents and new lines. We match a newline followed by any number of spaces. We then set the number of spaces as the new value of this.indent and compare the new indent against the old indent to decide what token type to return.
To generate the complete stream of tokens we just repeatedly call these methods until there is no more text.
Note how if each token type doesn't match, we simply fall through to the next token type.
The last token type we consider is `fail`, which just throws an error to indicate that there was some unexpected text.
Finally, we push an "end of stream" token, to indicate that there are no more characters to read.
Once the lexer has generated a stream of tokens, like this one, it is the job of the parser to transform that stream of tokens into the logical tree structure that matches how we (as programmers) view the code.
[walk]
Notice how the "article" contains the "h1" and the "p"
Once the lexer has generated a stream of tokens, like this one, it is the job of the parser to transform that stream of tokens into the logical tree structure that matches how we (as programmers) view the code.
[walk]
Notice how the "article" contains the "h1" and the "p"
Our parser starts with a stream of tokens
At the top level, we keep looking to parse a new tag until we get to the end of the stream.
Start by defining the list of nodes, then while the next token is not "end of stream":
- if it's a tag, parseTag
- if it's a newline, ignore it
When parsing a tag, we call into `parseText` or `parseBlock` to recursively pass the content of the tag, depending on the type of the next token.
Parsing a block looks just like parsing the file, except that instead of an "end of stream" token, we are looking for an "outdent" token.
Now we have this "abstract syntax tree", we need to generate some output code from it.
We'll recursively convert each node of the tree from the bottom up
[walk]
The code for this starts with a render method. It will recursively call the appropriate function to render each type of node. Note how these methods mirror the methods of the parser.
To render a block, we simply render all the nodes within it.
To render a tag, we simply build the bits of text that the tag itself provides, and then recursively render the content.
Rendering text is just a case of returning the text
[walk]
[emphasis]
[smile]
Now you have hopefully seen how the parts of the compiler fit together
Pug also supports splitting your code into multiple files. One of the simplest ways it does this is via `includes.
An include results in two separate abstract syntax trees, which must be joined together.
To do this without complicating the existing stages of our compiler, we will add an additional stage to our compiler.
An include results in two separate abstract syntax trees, which must be joined together.
To do this without complicating the existing stages of our compiler, we will add an additional stage to our compiler.
Lets review the stages of this new compiler pipeline
Talk through stages
[walk]
Clear Separation of Stages
Clear Extension Points
Uses our own extension points where possible