librdesc
Loading...
Searching...
No Matches
Quick Start

Writing a parser in C is a very fun, interactive process, but sometimes an exhausting task. How to ensure its memory safety? How do you test it?

librdesc is designed to abstract away memory management and provide an easy-to-use API.

This document is here to help you in getting started with librdesc by defining a simple pipe-math calculator. You can also check out Examples for more concrete examples.

Code Organization

While it is tempting to insert the parsing code right inside the rest of the logic, it usually results in unmaintainable code. Usually, we separate grammar definition into another header, so you could have a src/pm_grammar.h file containing grammar declaration:

#include <rdesc/grammar.h>
/* let's use `pm` as abbreviation for Pipe Math. */
#define PM_PRODUCTION_COUNT 9
#define PM_MAX_ALTERNATIVE_COUNT 2
#define PM_MAX_ALTERNATIVE_SIZE 4
extern struct rdesc_grammar_symbol pm_grammar[PM_PRODUCTION_COUNT]
#define PM_MAX_ALTERNATIVE_COUNT
Definition pm_grammar.h:15
#define PM_PRODUCTION_COUNT
[Grammar declaration]
Definition pm_grammar.h:13
#define PM_MAX_ALTERNATIVE_SIZE
Definition pm_grammar.h:17
A terminal or nonterminal representing the body (right side) of a production rule.
Definition grammar.h:87

We will later discuss what the magic constants are in the file and why there is +1 in array dimensions.

Writing First Grammar

To demonstrate librdesc's power, we will implement a grammar that handles both right-recursion and left-recursion. Our syntax has two operators:

  1. Right-associative ^: Exponentiation. 2 ^ 3 ^ 1 is evaluated as 2 ^ (3 ^ 1), 8.
  2. Left-associative |: Pipe. 2 ^ 3 | add(10) | square is equal to (8 | add(10)) | square.

Statements in our grammar start with a number, which can be expressed as an exponent. Then, for each pipe, from left to right, apply an operation to the number.

Let's first define our grammar formally, as a context free grammar:

⟨Stmt⟩ → ⟨Expr⟩ ;
⟨Expr⟩ → ⟨ExponentiationExpr⟩ | ⟨PipeExpr⟩
⟨Expr⟩ → ⟨ExponentiationExpr⟩
⟨ExponentiationExpr⟩ → ⟨ExponentiationExpr⟩ ^ ⟨ExponentiationExpr⟩
⟨ExponentiationExpr⟩ → ⟨Num⟩
⟨PipeExpr⟩ → ⟨PipeExpr⟩ | ⟨PipeExpr⟩
⟨PipeExpr⟩ → ⟨FunctionCall⟩
⟨FunctionCall⟩ → ⟨Identifier⟩ ( ⟨FunctionArgs⟩ )
⟨FunctionCall⟩ → ⟨Identifier⟩
⟨FunctionArgs⟩ → ⟨FunctionArgs⟩ , ⟨FunctionArgs⟩
⟨FunctionArgs⟩ → ⟨Expr⟩
⟨Num⟩... (list of digits, definition not shown to avoid crowd)
⟨Identifier⟩... (list of alphanumeric characters, starting with _ or a letter)

This ambiguous CFG does not dictate associativity, but let's assume operators have the associativity we introduced. Then the Concrete Syntax Tree (CST) for 2 ^ 3 ^ 1 | add(10, 2 ^ 1) | square | log; would be (simplified nodes shown on edges):

Token vs. Terminal
In a compiler, the lexical analyzer reads the characters of the source, groups them into lexically meaningful units called lexemes, and produces as output tokens representing these lexemes. A token consists of two components, a token identifier and a seminfo. Often, we shall call these token identifiers terminals*, since they appear as terminal symbols in the grammar. The seminfo, if present, contain additional information about the token. This additional information is not part of the grammar.

Tokens and Nonterminals

librdesc can be fed char-by-char, or it can operate on tokens with a lexer. It is recommended to perform simpler parsing in the lexer (e.g. collecting numbers and identifiers) and then use librdesc to parse them.

Let's define tokens for punctuation, numbers, and identifiers. The lexer will collect numbers and identifiers and return them as a single token. This way, we avoid having to implement a digit list or an alphanumeric character list, making it easier to handle numbers and identifiers.

enum pm_tk {
TK_NUM, // ⟨Num⟩
TK_IDENT, // ⟨Identifier⟩
TK_LPAREN /* ( */, TK_RPAREN /* ) */,
TK_PIPE /* | */, TK_CARET /* ^ */,
TK_COMMA /* , */, TK_SEMI /* ; */,
};
pm_tk
[Grammar declaration]
Definition pm_grammar.h:31

We define recursive nonterminals with additional _REST postfixed versions. Later on, we will discuss why this is required.

enum pm_nt {
/* ⟨Stmt⟩, ⟨Expr⟩ */
NT_STMT, NT_EXPR,
/* ⟨Num⟩ (^ ⟨Num⟩)* */
NT_EXPONENTIATION_EXPR, NT_EXPONENTIATION_EXPR_REST,
/* ⟨FunctionCall⟩ (| ⟨FunctionCall⟩)* */
NT_PIPE_EXPR, NT_PIPE_EXPR_REST,
/* We will define ⟨PipeExpr⟩ as right-associative for now. Later, we
* will use flip function to convert pipe operator left-associative. */
/* ⟨Expr⟩ (, ⟨Expr⟩)* */
NT_FUNCTION_ARG_LS, NT_FUNCTION_ARG_LS_REST,
NT_FUNCTION_CALL,
};
pm_nt
[Token definition]
Definition pm_grammar.h:42

Symbol Table

The symbol table is a 3D array of struct rdesc_grammar_symbol. The three dimensions are: the production (one row per nonterminal), the alternative (one row per alternative within that production), and the symbol (one cell per symbol in the alternative body). librdesc provides rule_macros.h to define this table without manually inserting sentinels.

Here is the list of core macros:

Macro
r(...) Start a production, alternatives separated by alt
alt Separator between alternatives within a production
TK(X) Reference to terminal token TK_X
NT(X) Reference to nonterminal NT_X

r(...) macro adds a sentinel to end of the alternative list and alt adds a sentinel to end of the alternative. That is why we added +1 in grammar struct dimensions.

Using these macros, productions for statement and expression nonterminals can be defined as:

/* <stmt> ::= */ r(
NT(EXPR), TK(SEMI)
),
/* <expr> ::= */ r(
NT(EXPONENTIATION_EXPR), TK(PIPE), NT(PIPE_EXPR)
alt NT(EXPONENTIATION_EXPR)
)
#define NT(nt)
Macro to create a nonterminal production symbol.
Definition rule_macros.h:58
#define TK(tk)
Macro to create a terminal (token) production symbol.
Definition rule_macros.h:56
#define r(...)
Macro to define a grammar rule. Adds end of alternative and end of production body sentinels to gramm...
Definition rule_macros.h:72
#define alt
Separates grammar alternatives (alternative separator).
Definition rule_macros.h:133

Tokens and nonterminals are automatically prefixed by NT(X)/TK(X) macros. That is why we used those prefixes in enum definition but not in grammar table. See Name Mapping for Identifiers for details.

The grammar table follows ordered-choice semantics - the first matching alternative always wins. In the ambiguous grammar example above, if a pipe token is present, the first alternative is matched; otherwise, the second alternative is matched.

We will use / (ordered-choice) operator with BNF from now on to emphasize ordered-choice, so we can rewrite the grammar we defined so far as:

<stmt> ::= <expr> ";"
<expr> ::= <exponentiation_expr> "|" <pipe_expr>
/ <exponentiation_expr>

librdesc provides helper macros to standardize grammar definitions:

Macro
ropt(...) Expands into two alternatives: an empty alternative (ε) and one with ...
rrr(...) Define right-recursive list rule
EPSILON Empty alternative (ε)

See the documentation for rrr(...) before continuing.

/* <exponentiation_expr> ::= */
rrr(EXPONENTIATION_EXPR, (TK(NUM)), (TK(CARET), TK(NUM))),
/* this also expands to <exponentiation_expr_rest> */
/* <pipe_expr> ::= */
rrr(PIPE_EXPR, (NT(FUNCTION_CALL)), (TK(PIPE), NT(FUNCTION_CALL))),
/* similarly, this also expands to <pipe_expr_rest> */
/* <function_arg_ls> ::= */
rrr(FUNCTION_ARG_LS, (NT(EXPR)), (TK(COMMA), NT(EXPR)))
/* <function_arg_ls_rest> */
#define rrr(head, base, suffix)
Defines right-recursive list rules. base and suffix parameters should wrapped with parenthesis (()).
Definition rule_macros.h:121

The code above generates:

<exponentiation_expr> ::= TK_NUM <exponentiation_expr_rest>
<exponentiation_expr_rest> ::= "^" TK_NUM <exponentiation_expr_rest>
/ E
<pipe_expr> ::= <function_call> <pipe_expr_rest>
<pipe_expr_rest> ::= "|" <function_call> <pipe_expr_rest>
/ E
<function_arg_ls> ::= <expr> <function_arg_ls_rest>
<function_arg_ls_rest> ::= "," <expr> <function_arg_ls_rest>
/ E

And the rest of the grammar is:

/* <function_call> ::= */ r(
TK(IDENT), TK(LPAREN), NT(FUNCTION_ARG_LS), TK(RPAREN)
alt TK(IDENT)
)

Which generates:

<function_call> ::= TK_IDENT "(" <function_arg_ls> ")"
/ TK_IDENT

Complete grammar exists at Pipe-Math grammar definition.