librdesc
Loading...
Searching...
No Matches
Quick Start

Writing a parser in C is a very fun, interactive process, but sometimes an exhausting task. How to ensure its memory safety? How to keep parsing transactional and recover from errors? How do you test it?

librdesc is designed to abstract away resource management and provide an easy-to-use API. This document is here to help you in getting started with librdesc by defining a simple Pipe-Math calculator.

The program used throughout this documentation is provided in docs/quick-start/src/ and can be built using make in Unix-like systems.

Code Organization

While it is tempting to insert the parsing code right inside the rest of the logic, it usually results in unmaintainable code. Usually, we separate grammar declaration into a header, so you could have a src/grammar.h file in your projects containing grammar declaration.

#include <rdesc/grammar.h>
/* let's use `pm` as abbreviation for Pipe-Math. */
#define PM_PRODUCTION_COUNT 9
#define PM_MAX_ALTERNATIVE_COUNT 2
#define PM_MAX_ALTERNATIVE_SIZE 4
extern struct rdesc_grammar_symbol pm_grammar[PM_PRODUCTION_COUNT]
[PM_MAX_ALTERNATIVE_COUNT + 1]
[PM_MAX_ALTERNATIVE_SIZE + 1];
A terminal or nonterminal representing the body (right side) of a production rule.
Definition grammar.h:89

We will later discuss what the magic constants are in the file and why there is +1 in array dimensions.

Writing First Grammar

To demonstrate librdesc's power, we will implement a grammar that handles both right-recursion and left-recursion. Our syntax has two operators:

  1. Right-associative ^: Exponentiation. 2 ^ 3 ^ 1 is evaluated as 2 ^ (3 ^ 1), 8.
  2. Left-associative |: Pipe. 2 ^ 3 | add(10) | square is equal to (8 | add(10)) | square.

Statements in our grammar start with a number, which can be expressed as an exponent. Then, for each pipe, from left to right, apply an operation to the number.

Let's first define our grammar formally, as a context free grammar:

⟨Stmt⟩ → ⟨Expr⟩ ;
⟨Expr⟩ → ⟨ExponentiationExpr⟩ | ⟨PipeExpr⟩
⟨Expr⟩ → ⟨ExponentiationExpr⟩
⟨ExponentiationExpr⟩ → ⟨ExponentiationExpr⟩ ^ ⟨ExponentiationExpr⟩
⟨ExponentiationExpr⟩ → ⟨Num⟩
⟨PipeExpr⟩ → ⟨PipeExpr⟩ | ⟨PipeExpr⟩
⟨PipeExpr⟩ → ⟨FunctionCall⟩
⟨FunctionCall⟩ → ⟨Identifier⟩ ( ⟨FunctionArgs⟩ )
⟨FunctionCall⟩ → ⟨Identifier⟩
⟨FunctionArgs⟩ → ⟨FunctionArgs⟩ , ⟨FunctionArgs⟩
⟨FunctionArgs⟩ → ⟨Expr⟩
⟨Num⟩... (list of digits, definition not shown for brevity)
⟨Identifier⟩... (list of alphanumeric characters, starting with _ or a letter)

This ambiguous CFG does not dictate associativity, but let's assume operators have the associativity we introduced. Then the Concrete Syntax Tree (CST) for 2 ^ 3 ^ 1 | add(10, 2 ^ 1) | square | log; would be (simplified nodes shown on edges):

Before continuing to read this documentation, you may prefer to review the librdesc Glossary to ensure we are on the same page regarding terminology.

Defining Tokens and Nonterminals

librdesc can be fed char-by-char, or it can operate on tokens with a lexer. It is recommended to perform simpler parsing in the lexer (e.g. collecting numbers and identifiers) and then use librdesc to parse them.

Let's define tokens for punctuation, numbers, and identifiers. The lexer will collect numbers and identifiers and return them as a single token. This way, we avoid having to implement a digit list or an alphanumeric character list, making it easier to handle numbers and identifiers. Continuing the grammar definitions in src/grammar.h:

enum pm_tk {
TK_NUM = 1, // ⟨Num⟩
TK_IDENT, // ⟨Identifier⟩
TK_LPAREN /* ( */, TK_RPAREN /* ) */,
TK_PIPE /* | */, TK_CARET /* ^ */,
TK_COMMA /* , */, TK_SEMI /* ; */,
};
Note
You are free to start token enum from any number, but in the example above we started from 1 because of the lexer we will use.

We define recursive nonterminals with additional _REST postfixed versions. Later on, we will discuss why this is required.

enum pm_nt {
/* ⟨Stmt⟩, ⟨Expr⟩ */
NT_STMT, NT_EXPR,
/* ⟨Num⟩ (^ ⟨Num⟩)* */
NT_EXPONENTIATION_EXPR, NT_EXPONENTIATION_EXPR_REST,
/* ⟨FunctionCall⟩ (| ⟨FunctionCall⟩)* */
NT_PIPE_EXPR, NT_PIPE_EXPR_REST,
/* We will define ⟨PipeExpr⟩ as right-associative for now. Later, we
* will use flip function to convert pipe operator left-associative. */
/* ⟨Expr⟩ (, ⟨Expr⟩)* */
NT_FUNCTION_ARG_LS, NT_FUNCTION_ARG_LS_REST,
NT_FUNCTION_CALL,
};

Symbol Table

The symbol table is a 3D array of struct rdesc_grammar_symbol. The three dimensions are: the production (one row per nonterminal), the alternative (one row per alternative within that production), and the symbol (one cell per symbol in the alternative body). librdesc provides rule_macros.h to define this table without manually inserting sentinels.

Here is the list of core macros:

Macro
r(...) Start a production, alternatives separated by alt
alt Separator between alternatives within a production
TK(X) Reference to terminal token TK_X
NT(X) Reference to nonterminal NT_X

r(...) macro adds a sentinel to end of the alternative list and alt adds a sentinel to end of the alternative. That is why we added +1 in symbol table dimensions.

Using these macros, productions for statement and expression nonterminals can be defined as, in src/grammar.c:

struct rdesc_grammar_symbol pm_grammar[PM_PRODUCTION_COUNT]
[PM_MAX_ALTERNATIVE_COUNT + 1]
[PM_MAX_ALTERNATIVE_SIZE + 1] = {
/* <stmt> ::= */ r(
NT(EXPR), TK(SEMI)
),
/* <expr> ::= */ r(
NT(EXPONENTIATION_EXPR), TK(PIPE), NT(PIPE_EXPR)
alt NT(EXPONENTIATION_EXPR)
)
#define NT(nt)
Macro to create a nonterminal production symbol.
Definition rule_macros.h:58
#define TK(tk)
Macro to create a terminal (token) production symbol.
Definition rule_macros.h:56
#define r(...)
Macro to define a grammar rule. Adds end of alternative and end of production body sentinels to gramm...
Definition rule_macros.h:72
#define alt
Separates grammar alternatives (alternative separator).

Tokens and nonterminals are automatically prefixed by NT(X)/TK(X) macros. That is why we used those prefixes in enum definition but not in grammar table. See Name Mapping for Identifiers for details.

The grammar table follows ordered-choice semantics - the first matching alternative always wins. In the ambiguous grammar example above, if a pipe token is present, the first alternative is matched; otherwise, the second alternative is matched.

We will use / (ordered-choice) operator with BNF from now on to emphasize ordered-choice, so we can rewrite the grammar we defined so far as:

<stmt> ::= <expr> ";"
<expr> ::= <exponentiation_expr> "|" <pipe_expr>
/ <exponentiation_expr>

librdesc provides helper macros to standardize grammar definitions:

Macro
ropt(...) Expands into two alternatives: an empty alternative (ε) and one with ...
rrr(...) Define right-recursive list rule
EPSILON Empty alternative (ε)

See the documentation for rrr(...) before continuing.

/* <exponentiation_expr> ::= */
rrr(EXPONENTIATION_EXPR, (TK(NUM)), (TK(CARET), TK(NUM))),
/* this also expands to <exponentiation_expr_rest> */
/* <pipe_expr> ::= */
rrr(PIPE_EXPR, (NT(FUNCTION_CALL)), (TK(PIPE), NT(FUNCTION_CALL))),
/* similarly, this also expands to <pipe_expr_rest> */
/* <function_arg_ls> ::= */
rrr(FUNCTION_ARG_LS, (NT(EXPR)), (TK(COMMA), NT(EXPR)))
/* <function_arg_ls_rest> */
#define rrr(head, base, suffix)
Defines right-recursive list rules. base and suffix parameters should wrapped with parenthesis (()).

The code above generates:

<exponentiation_expr> ::= TK_NUM <exponentiation_expr_rest>
<exponentiation_expr_rest> ::= "^" TK_NUM <exponentiation_expr_rest>
/ E
<pipe_expr> ::= <function_call> <pipe_expr_rest>
<pipe_expr_rest> ::= "|" <function_call> <pipe_expr_rest>
/ E
<function_arg_ls> ::= <expr> <function_arg_ls_rest>
<function_arg_ls_rest> ::= "," <expr> <function_arg_ls_rest>
/ E

And the rest of the grammar is:

/* <function_call> ::= */ r(
TK(IDENT), TK(LPAREN), NT(FUNCTION_ARG_LS), TK(RPAREN)
alt TK(IDENT)
)

Which generates:

<function_call> ::= TK_IDENT "(" <function_arg_ls> ")"
/ TK_IDENT

Complete grammar exists at Pipe-Math Grammar Definition.

Initializing the Grammar

With the symbol table defined, we can begin implementing our program, starting by initializing the grammar in src/main.c:

#include <rdesc/grammar.h>
struct rdesc_grammar grammar;
/* Returns non-zero value on memory allocation error, abort the program in this
* case. */
PM_PRODUCTION_COUNT,
PM_MAX_ALTERNATIVE_COUNT,
PM_MAX_ALTERNATIVE_SIZE,
pm_grammar) == 0);
#define rdesc_grammar_init_checked(grammar, production_count, max_alternative_count, max_alternative_size, production_rules)
Grammar initialization instrumented with size assertion.
Definition grammar.h:27
Grammar definition.
Definition grammar.h:50

There are two grammar initialization functions: rdesc_grammar_init() and its statically checked version rdesc_grammar_init_checked(). The latter checks whether or not provided dimensions are valid, raising an error if not; but cannot be used with dynamically-allocated grammar tables.

We can dump our grammar rules in BNF format via the rdesc_dump_bnf() utility. Let's add an option for printing out grammar rules, but first we need to define names of tokens and nonterminals:

const char *tk_names[] = {
"\0", /* We skip token with id 0, as we started enum from 1. */
"@num", "@ident", /* rdesc_dump_bnf utility wraps tokens with double
* quotes ("). Use @ to suppress automatic wrap. */
"(", ")", "|", "^", ",", ";",
};
const char *nt_names[] = {
"stmt", "expr",
"exponentiation_expr", "exponentiation_expr_rest",
"pipe_expr", "pipe_expr_rest",
"function_arg_ls", "function_arg_ls_rest",
"function_call",
};

Then, add the option to the main function:

#include <rdesc/util.h>
if (argc > 1 && strcmp(argv[1], "dump_bnf") == 0) {
rdesc_dump_bnf(stdout, &grammar, tk_names, nt_names);
}
void rdesc_dump_bnf(FILE *out, const struct rdesc_grammar *grammar, const char *const tk_names[], const char *const nt_names[])
Dumps the grammar in BNF format.

Now, you can dump the grammar using ./pm dump_bnf.

Lexer

Lexical analysis is out of scope for this document. We will use exblex, a minimal single-header lexer included with librdesc's examples, to keep the example self-contained.

exblex tokenizes input in three ways:

  • Words (w): sequences of alphanumeric characters and underscores, matched as TK_IDENT.
  • Digits (d): sequences of numeric digits, matched as TK_NUM.
  • Punctuation: single characters looked up in a token table, matched to their corresponding token id.

Token IDs are determined by position in the token table. Index 0 is reserved by exblex for EOF/NOTOKEN, so TK_NUM must start from 1.

const char exblex_tks[] = {
'\0', /* exblex requires null termination at the beginning and end */
'd', /* num */ 'w', /* ident */
'(', ')', '|', '^', ',', ';',
'\0'
};

Punctuation tokens only return a token ID in exblex, but identifiers and numbers also return seminfo containing their string value.

Initializing the Parser

Having the grammar and lexer ready, we now can initialize our parser:

#include <rdesc/rdesc.h>
struct rdesc parser;
/* Similar to grammar_init, this function also returns non-zero in case of
* memory allocation failure. */
assert(rdesc_init(&parser, &grammar, sizeof(char *), tk_destroyer) == 0);
int rdesc_init(struct rdesc *parser, const struct rdesc_grammar *grammar, size_t seminfo_size, void(*token_destroyer)(uint16_t id, void *seminfo)) _rdesc_wur
Initializes a new parser.
Recursive descent parser state.
Definition rdesc.h:40

The parser requires a method pointer to a destroyer function. See Memory Ownership Semantics for details.

void tk_destroyer(uint16_t id, void *seminfo)
{
if (id == TK_NUM || id == TK_IDENT) {
char *string;
/* seminfo is a pointer to user-specific data, which is char *
* in our program. Extract the char * from void * by
* type-punning. */
memcpy(&string, seminfo, sizeof(char *));
free(string);
}
}

This snippet reads a line from stdin and feeds tokens to the parser, handling EOF and OOM:

struct exblex lexer;
enum rdesc_result res;
char buf[4096];
if (fgets(buf, 4096, stdin) == NULL)
buf[0] = '\0';
exblex_init(&lexer, buf, exblex_tks);
/* Memory allocation assertion. */
assert(rdesc_start(&parser, NT_STMT) == 0);
do {
uint16_t tk_id = exblex_next(&lexer);
char *tk_seminfo = exblex_current_seminfo(&lexer);
/* EOF before match */
if (tk_id == 0) {
break;
}
res = rdesc_pump(&parser, tk_id, &tk_seminfo);
/* Retry if allocation failed. */
while (res == RDESC_ENOMEM)
res = rdesc_resume(&parser);
} while (res == RDESC_CONTINUE);
static char * exblex_current_seminfo(struct exblex *l)
Retrieves the semantic information for the last token.
Definition exblex.h:107
static uint16_t exblex_next(struct exblex *l)
Fetches the next token.
Definition exblex.h:122
static void exblex_init(struct exblex *l, const char *buf, const char *tokens)
Initializes the basic lexer with a null-terminated list of chars.
Definition exblex.h:82
int rdesc_start(struct rdesc *parser, uint16_t start_symbol) _rdesc_wur
Sets start symbol for the next match.
enum rdesc_result rdesc_pump(struct rdesc *parser, uint16_t id, void *seminfo) _rdesc_wur
Drives the parsing process, the pump.
enum rdesc_result rdesc_resume(struct rdesc *parser) _rdesc_wur
Resume parsing without providing a new token.
rdesc_result
Parse operation result codes.
Definition rdesc.h:28
@ RDESC_ENOMEM
Definition rdesc.h:30
@ RDESC_NOMATCH
Definition rdesc.h:36
@ RDESC_CONTINUE
Definition rdesc.h:34
EXtremely Basic LEXer.
Definition exblex.h:37
const char * buf
Underlying null-terminated input buffer.
Definition exblex.h:39

For debugging, let's add an option for CST dumping:

#include <rdesc/util.h>
if (argc > 1 && strcmp(argv[1], "dump_cst") == 0 && res == RDESC_READY) {
rdesc_dump_cst(stdout, &parser, node_printer);
}
@ RDESC_READY
Definition rdesc.h:32
void rdesc_dump_cst(FILE *out, const struct rdesc *parser, void(*node_printer)(FILE *out, const struct rdesc_node *))
Dumps the concrete syntax tree (CST) as a graphviz DOT graph.

The CST dump function requires a function pointer for printing out nodes. We will discuss which attributes nonterminals and tokens have. In the printer function, we used type, id, and seminfo fields:

#include <rdesc/cst_macros.h>
void node_printer(FILE *out, const struct rdesc_node *node)
{
if (rtype(node) == RDESC_TOKEN) {
if (rid(node) == TK_NUM || rid(node) == TK_IDENT) {
char *string;
memcpy(&string, rseminfo(node), sizeof(char *));
fprintf(out, "[shape=box,label=<%s>]", string);
} else {
fprintf(out, "[shape=box,label=<%s>]", tk_names[rid(node)]);
}
} else {
fprintf(out, "[label=\"%s\"]", nt_names[rid(node)]);
}
}
#define rid(node)
Returns the 15-bit identifier for underlying token/nonterminal.
Definition cst_macros.h:77
#define rtype(node)
Returns node type (RDESC_TOKEN or RDESC_NONTERMINAL).
Definition cst_macros.h:70
#define rseminfo(tk_node)
Returns a reference to the token's seminfo field.
Definition cst_macros.h:99
@ RDESC_TOKEN
Token.
Definition grammar.h:75

Now, you can dump the CST as a Graphviz dot file using ./pm dump_cst. I reccomend examinig the dumped CST file using online Graphviz visualizers.

Interpreter

With the grammar initialised, the lexer providing tokens, and the parser matching a statement, we can now evaluate the resulting CST. We traverse the tree using the following macros from cst_macros.h:

Macro Used for
rparent() Both Pointer to parent of the node
rtype() Both Node type: RDESC_TOKEN or RDESC_NONTERMINAL
rid() Both Token or nonterminal ID
rseminfo() Tokens Additional seminfo of the token
ralt_idx() Nonterminals Index of the matched alternative
rchild_count() Nonterminals Total number of child nodes
rchild() Nonterminals Get child by its index

Using the macros above, we can write a recursive function evaluating Pipe-Math statements, called from the program entry:

if (res == RDESC_READY) {
printf(">> %.2lf\n", pm_interpreter(&parser, rdesc_root(&parser)));
}
struct rdesc_node * rdesc_root(struct rdesc *parser)
Returns the root of the CST.

The following function evaluates each CST node by dispatching on nonterminal ID and alternative index, in src/interpreter.c:

#include <rdesc/rdesc.h>
#include <rdesc/cst_macros.h>
#include <rdesc/util.h>
/* We only call this function with nonterminals, starting from <stmt>. */
double pm_interpreter(struct rdesc *p,
struct rdesc_node *n)
{
switch (rid(n)) {
case NT_STMT:
/* continue evaluation from first child (<expr>) */
return pm_interpreter(p, rchild(p, n, 0));
case NT_EXPR:
switch (ralt_idx(n)) {
case 0: /* <exponentiation_expr> "|" <pipe_expr>*/
/* flip <pipe_expr> to left-associative order */
rdesc_flip_left(p, n, 2);
/* Start pipe evaluation from left-to-right. */
return pm_interpret_pipe(p,
rchild(p, n, 2),
pm_interpreter(p, rchild(p, n, 0)));
default: /* <exponentiation_expr> */
return pm_interpreter(p, rchild(p, n, 0));
}
case NT_EXPONENTIATION_EXPR:
/* <num> <exponentiation_expr_rest> */
return pow(pm_extract_num(rchild(p, n, 0)),
pm_interpreter(p, rchild(p, n, 1)));
case NT_EXPONENTIATION_EXPR_REST:
switch (ralt_idx(n)) {
case 0: /* "^" <num> <exponentiation_expr_rest> */
return pow(pm_extract_num(rchild(p, n, 1)),
pm_interpreter(p, rchild(p, n, 2)));
default: /* E */
return 1;
}
}
return 0; // GCOVR_EXCL_LINE: Unreachable
}
#define rchild(p, nt_node, child_idx)
Returns child of the node by its index.
Definition cst_macros.h:115
#define ralt_idx(nt_node)
Returns index of the nonterminal alternative in production rule.
Definition cst_macros.h:84
void rdesc_flip_left(struct rdesc *parser, struct rdesc_node *parent, uint16_t child_index)
Rotates a right-recursive concrete syntax tree into a left-recursive form.

We free the seminfo field when extracting numbers from token nodes. After a successful match, the caller is responsible for destroying token seminfo, as described in Memory Ownership Semantics.

double pm_extract_num(struct rdesc_node *num)
{
char *seminfo;
/* seminfo stores a pointer to number, so it is a char **.
*
* Aside: We need to use memcpy to retrieve char * from char ** to not
* break strict-aliasing rule. */
memcpy(&seminfo, rseminfo(num), sizeof(char *));
double seminfo_num = atof(seminfo);
free(seminfo);
return seminfo_num;
}

Pipe operators are semantically left-associative: a | f | g means (a | f) | g. However, rrr() generates right-recursive rules, so the matched <pipe_expr> tree is right-associative by structure. Before evaluating, we call rdesc_flip_left() to rotate the subtree in-place, converting it to left-associative order. Evaluation then proceeds left-to-right:

/* Called for <pipe_expr>. */
double pm_interpret_pipe(struct rdesc *p,
struct rdesc_node *pipe,
double lhs /* value at left-hand side of pipe */)
{
switch (ralt_idx(pipe)) {
case 0: /* <pipe_expr> "|" <function_call> */
return pm_interpret_function(p, rchild(p, pipe, 2),
pm_interpret_pipe(p, rchild(p, pipe, 0), lhs));
default: /* <function_call> */
return pm_interpret_function(p, rchild(p, pipe, 0), lhs);
}
}

Important: After the flip, node IDs and alternative indices reflect the new tree structure. The function converts the structure represented by:

<pipe_expr> ::= <function_call> <pipe_expr_rest>
<pipe_expr_rest> ::= "|" <function_call> <pipe_expr_rest>
/ E

into the left-recursive equivalent:

<pipe_expr> ::= <pipe_expr> "|" <function_call>
/ <function_call>

Now the list starts and continues with <pipe_expr> in alternative 0, and ends with <pipe_expr> in alternative 1. See rdesc_flip_left() for details.

That's it! As mentioned earlier, the sample code for this document is located in the docs/quick-start/src/ folder. For brevity, we did not include the implementation for evaluation of the <function_call> expression, but you can find it in that folder.

Exercise: Visualizing rdesc_flip_left
To deepen your understanding of how rdesc_flip_left restructures the tree, try the following:
  1. Define a minimal grammar using rrr() for a list, wrapped in a start production that consumes a terminating token (e.g. ;). The terminating token is required as the _REST nonterminal matches epsilon only when the next token does not match the list separator, so the parser reaches RDESC_READY only after consuming the end symbol.
  2. Parse an input string such as a | b | c ;.
  3. Dump the CST with rdesc_dump_cst() before calling rdesc_flip_left().
  4. Call rdesc_flip_left() and dump the CST again.
  5. Compare the two Graphviz outputs. Notice how node IDs and alternative indices change after the flip.

For more complete examples, see Examples.

One Last Step

You should call destroy functions both for the grammar and the parser instances. The parser's destructor calls token_destroyer for tokens the parser owns:

rdesc_destroy(&parser);
void rdesc_grammar_destroy(struct rdesc_grammar *grammar)
Frees resources allocated by the grammar.
void rdesc_destroy(struct rdesc *parser)
Frees memory allocated by the parser and destroys the parser instance.