|
librdesc
|
Writing a parser in C is a very fun, interactive process, but sometimes an exhausting task. How to ensure its memory safety? How do you test it?
librdesc is designed to abstract away memory management and provide an easy-to-use API.
This document is here to help you in getting started with librdesc by defining a simple pipe-math calculator. You can also check out Examples for more concrete examples.
While it is tempting to insert the parsing code right inside the rest of the logic, it usually results in unmaintainable code. Usually, we separate grammar definition into another header, so you could have a src/pm_grammar.h file containing grammar declaration:
We will later discuss what the magic constants are in the file and why there is +1 in array dimensions.
To demonstrate librdesc's power, we will implement a grammar that handles both right-recursion and left-recursion. Our syntax has two operators:
^: Exponentiation. 2 ^ 3 ^ 1 is evaluated as 2 ^ (3 ^ 1), 8.|: Pipe. 2 ^ 3 | add(10) | square is equal to (8 | add(10)) | square.Statements in our grammar start with a number, which can be expressed as an exponent. Then, for each pipe, from left to right, apply an operation to the number.
Let's first define our grammar formally, as a context free grammar:
This ambiguous CFG does not dictate associativity, but let's assume operators have the associativity we introduced. Then the Concrete Syntax Tree (CST) for 2 ^ 3 ^ 1 | add(10, 2 ^ 1) | square | log; would be (simplified nodes shown on edges):
seminfo. Often, we shall call these token identifiers terminals*, since they appear as terminal symbols in the grammar. The seminfo, if present, contain additional information about the token. This additional information is not part of the grammar.librdesc can be fed char-by-char, or it can operate on tokens with a lexer. It is recommended to perform simpler parsing in the lexer (e.g. collecting numbers and identifiers) and then use librdesc to parse them.
Let's define tokens for punctuation, numbers, and identifiers. The lexer will collect numbers and identifiers and return them as a single token. This way, we avoid having to implement a digit list or an alphanumeric character list, making it easier to handle numbers and identifiers.
We define recursive nonterminals with additional _REST postfixed versions. Later on, we will discuss why this is required.
The symbol table is a 3D array of struct rdesc_grammar_symbol. The three dimensions are: the production (one row per nonterminal), the alternative (one row per alternative within that production), and the symbol (one cell per symbol in the alternative body). librdesc provides rule_macros.h to define this table without manually inserting sentinels.
Here is the list of core macros:
| Macro | |
|---|---|
r(...) | Start a production, alternatives separated by alt |
| alt | Separator between alternatives within a production |
TK(X) | Reference to terminal token TK_X |
NT(X) | Reference to nonterminal NT_X |
r(...) macro adds a sentinel to end of the alternative list and alt adds a sentinel to end of the alternative. That is why we added +1 in grammar struct dimensions.
Using these macros, productions for statement and expression nonterminals can be defined as:
Tokens and nonterminals are automatically prefixed by NT(X)/TK(X) macros. That is why we used those prefixes in enum definition but not in grammar table. See Name Mapping for Identifiers for details.
The grammar table follows ordered-choice semantics - the first matching alternative always wins. In the ambiguous grammar example above, if a pipe token is present, the first alternative is matched; otherwise, the second alternative is matched.
We will use / (ordered-choice) operator with BNF from now on to emphasize ordered-choice, so we can rewrite the grammar we defined so far as:
librdesc provides helper macros to standardize grammar definitions:
| Macro | |
|---|---|
ropt(...) | Expands into two alternatives: an empty alternative (ε) and one with ... |
rrr(...) | Define right-recursive list rule |
| EPSILON | Empty alternative (ε) |
See the documentation for rrr(...) before continuing.
The code above generates:
And the rest of the grammar is:
Which generates:
Complete grammar exists at Pipe-Math grammar definition.