|
librdesc
|
Writing a parser in C is a very fun, interactive process, but sometimes an exhausting task. How to ensure its memory safety? How to keep parsing transactional and recover from errors? How do you test it?
librdesc is designed to abstract away resource management and provide an easy-to-use API. This document is here to help you in getting started with librdesc by defining a simple Pipe-Math calculator.
The program used throughout this documentation is provided in docs/quick-start/src/ and can be built using make in Unix-like systems.
While it is tempting to insert the parsing code right inside the rest of the logic, it usually results in unmaintainable code. Usually, we separate grammar declaration into a header, so you could have a src/grammar.h file in your projects containing grammar declaration.
We will later discuss what the magic constants are in the file and why there is +1 in array dimensions.
To demonstrate librdesc's power, we will implement a grammar that handles both right-recursion and left-recursion. Our syntax has two operators:
^: Exponentiation. 2 ^ 3 ^ 1 is evaluated as 2 ^ (3 ^ 1), 8.|: Pipe. 2 ^ 3 | add(10) | square is equal to (8 | add(10)) | square.Statements in our grammar start with a number, which can be expressed as an exponent. Then, for each pipe, from left to right, apply an operation to the number.
Let's first define our grammar formally, as a context free grammar:
This ambiguous CFG does not dictate associativity, but let's assume operators have the associativity we introduced. Then the Concrete Syntax Tree (CST) for 2 ^ 3 ^ 1 | add(10, 2 ^ 1) | square | log; would be (simplified nodes shown on edges):
Before continuing to read this documentation, you may prefer to review the librdesc Glossary to ensure we are on the same page regarding terminology.
librdesc can be fed char-by-char, or it can operate on tokens with a lexer. It is recommended to perform simpler parsing in the lexer (e.g. collecting numbers and identifiers) and then use librdesc to parse them.
Let's define tokens for punctuation, numbers, and identifiers. The lexer will collect numbers and identifiers and return them as a single token. This way, we avoid having to implement a digit list or an alphanumeric character list, making it easier to handle numbers and identifiers. Continuing the grammar definitions in src/grammar.h:
We define recursive nonterminals with additional _REST postfixed versions. Later on, we will discuss why this is required.
The symbol table is a 3D array of struct rdesc_grammar_symbol. The three dimensions are: the production (one row per nonterminal), the alternative (one row per alternative within that production), and the symbol (one cell per symbol in the alternative body). librdesc provides rule_macros.h to define this table without manually inserting sentinels.
Here is the list of core macros:
| Macro | |
|---|---|
r(...) | Start a production, alternatives separated by alt |
| alt | Separator between alternatives within a production |
TK(X) | Reference to terminal token TK_X |
NT(X) | Reference to nonterminal NT_X |
r(...) macro adds a sentinel to end of the alternative list and alt adds a sentinel to end of the alternative. That is why we added +1 in symbol table dimensions.
Using these macros, productions for statement and expression nonterminals can be defined as, in src/grammar.c:
Tokens and nonterminals are automatically prefixed by NT(X)/TK(X) macros. That is why we used those prefixes in enum definition but not in grammar table. See Name Mapping for Identifiers for details.
The grammar table follows ordered-choice semantics - the first matching alternative always wins. In the ambiguous grammar example above, if a pipe token is present, the first alternative is matched; otherwise, the second alternative is matched.
We will use / (ordered-choice) operator with BNF from now on to emphasize ordered-choice, so we can rewrite the grammar we defined so far as:
librdesc provides helper macros to standardize grammar definitions:
| Macro | |
|---|---|
ropt(...) | Expands into two alternatives: an empty alternative (ε) and one with ... |
rrr(...) | Define right-recursive list rule |
| EPSILON | Empty alternative (ε) |
See the documentation for rrr(...) before continuing.
The code above generates:
And the rest of the grammar is:
Which generates:
Complete grammar exists at Pipe-Math Grammar Definition.
With the symbol table defined, we can begin implementing our program, starting by initializing the grammar in src/main.c:
There are two grammar initialization functions: rdesc_grammar_init() and its statically checked version rdesc_grammar_init_checked(). The latter checks whether or not provided dimensions are valid, raising an error if not; but cannot be used with dynamically-allocated grammar tables.
We can dump our grammar rules in BNF format via the rdesc_dump_bnf() utility. Let's add an option for printing out grammar rules, but first we need to define names of tokens and nonterminals:
Then, add the option to the main function:
Now, you can dump the grammar using ./pm dump_bnf.
Lexical analysis is out of scope for this document. We will use exblex, a minimal single-header lexer included with librdesc's examples, to keep the example self-contained.
exblex tokenizes input in three ways:
w): sequences of alphanumeric characters and underscores, matched as TK_IDENT.d): sequences of numeric digits, matched as TK_NUM.Token IDs are determined by position in the token table. Index 0 is reserved by exblex for EOF/NOTOKEN, so TK_NUM must start from 1.
Punctuation tokens only return a token ID in exblex, but identifiers and numbers also return seminfo containing their string value.
Having the grammar and lexer ready, we now can initialize our parser:
The parser requires a method pointer to a destroyer function. See Memory Ownership Semantics for details.
This snippet reads a line from stdin and feeds tokens to the parser, handling EOF and OOM:
For debugging, let's add an option for CST dumping:
The CST dump function requires a function pointer for printing out nodes. We will discuss which attributes nonterminals and tokens have. In the printer function, we used type, id, and seminfo fields:
Now, you can dump the CST as a Graphviz dot file using ./pm dump_cst. I reccomend examinig the dumped CST file using online Graphviz visualizers.
With the grammar initialised, the lexer providing tokens, and the parser matching a statement, we can now evaluate the resulting CST. We traverse the tree using the following macros from cst_macros.h:
| Macro | Used for | |
|---|---|---|
rparent() | Both | Pointer to parent of the node |
rtype() | Both | Node type: RDESC_TOKEN or RDESC_NONTERMINAL |
rid() | Both | Token or nonterminal ID |
rseminfo() | Tokens | Additional seminfo of the token |
ralt_idx() | Nonterminals | Index of the matched alternative |
rchild_count() | Nonterminals | Total number of child nodes |
rchild() | Nonterminals | Get child by its index |
Using the macros above, we can write a recursive function evaluating Pipe-Math statements, called from the program entry:
The following function evaluates each CST node by dispatching on nonterminal ID and alternative index, in src/interpreter.c:
We free the seminfo field when extracting numbers from token nodes. After a successful match, the caller is responsible for destroying token seminfo, as described in Memory Ownership Semantics.
Pipe operators are semantically left-associative: a | f | g means (a | f) | g. However, rrr() generates right-recursive rules, so the matched <pipe_expr> tree is right-associative by structure. Before evaluating, we call rdesc_flip_left() to rotate the subtree in-place, converting it to left-associative order. Evaluation then proceeds left-to-right:
Important: After the flip, node IDs and alternative indices reflect the new tree structure. The function converts the structure represented by:
into the left-recursive equivalent:
Now the list starts and continues with <pipe_expr> in alternative 0, and ends with <pipe_expr> in alternative 1. See rdesc_flip_left() for details.
That's it! As mentioned earlier, the sample code for this document is located in the docs/quick-start/src/ folder. For brevity, we did not include the implementation for evaluation of the <function_call> expression, but you can find it in that folder.
rdesc_flip_leftrdesc_flip_left restructures the tree, try the following:rrr() for a list, wrapped in a start production that consumes a terminating token (e.g. ;). The terminating token is required as the _REST nonterminal matches epsilon only when the next token does not match the list separator, so the parser reaches RDESC_READY only after consuming the end symbol.a | b | c ;.rdesc_dump_cst() before calling rdesc_flip_left().rdesc_flip_left() and dump the CST again.For more complete examples, see Examples.
You should call destroy functions both for the grammar and the parser instances. The parser's destructor calls token_destroyer for tokens the parser owns: