Writing a parser in C is a very fun, interactive process, but sometimes an exhausting task. How to ensure its memory safety? How to keep parsing transactional and recover from errors? How do you test it?

librdesc is designed to abstract away resource management and provide an easy-to-use API. This document is here to help you in getting started with librdesc by defining a simple Pipe-Math calculator.

The program used throughout this documentation is provided in docs/quick-start/src/ and can be built using make in Unix-like systems.

Code Organization

While it is tempting to insert the parsing code right inside the rest of the logic, it usually results in unmaintainable code. Usually, we separate grammar declaration into a header, so you could have a src/grammar.h file in your projects containing grammar declaration.

#include <rdesc/grammar.h>
 
/* let's use `pm` as abbreviation for Pipe-Math. */
#define PM_PRODUCTION_COUNT 9
#define PM_MAX_ALTERNATIVE_COUNT 2
#define PM_MAX_ALTERNATIVE_SIZE 4
 
extern struct rdesc_grammar_symbol pm_grammar[PM_PRODUCTION_COUNT]
                                             [PM_MAX_ALTERNATIVE_COUNT + 1]
                                             [PM_MAX_ALTERNATIVE_SIZE + 1];

We will later discuss what the magic constants are in the file and why there is +1 in array dimensions.

Writing First Grammar

To demonstrate librdesc's power, we will implement a grammar that handles both right-recursion and left-recursion. Our syntax has two operators:

Right-associative ^: Exponentiation. 2 ^ 3 ^ 1 is evaluated as 2 ^ (3 ^ 1), 8.
Left-associative |: Pipe. 2 ^ 3 | add(10) | square is equal to (8 | add(10)) | square.

Statements in our grammar start with a number, which can be expressed as an exponent. Then, for each pipe, from left to right, apply an operation to the number.

Let's first define our grammar formally, as a context free grammar:

⟨Stmt⟩ → ⟨Expr⟩ ;
 
⟨Expr⟩ → ⟨ExponentiationExpr⟩ | ⟨PipeExpr⟩
⟨Expr⟩ → ⟨ExponentiationExpr⟩
 
⟨ExponentiationExpr⟩ → ⟨ExponentiationExpr⟩ ^ ⟨ExponentiationExpr⟩
⟨ExponentiationExpr⟩ → ⟨Num⟩
 
⟨PipeExpr⟩ → ⟨PipeExpr⟩ | ⟨PipeExpr⟩
⟨PipeExpr⟩ → ⟨FunctionCall⟩
 
⟨FunctionCall⟩ → ⟨Identifier⟩ ( ⟨FunctionArgs⟩ )
⟨FunctionCall⟩ → ⟨Identifier⟩
 
⟨FunctionArgs⟩ → ⟨FunctionArgs⟩ , ⟨FunctionArgs⟩
⟨FunctionArgs⟩ → ⟨Expr⟩
 
⟨Num⟩... (list of digits, definition not shown for brevity)
⟨Identifier⟩... (list of alphanumeric characters, starting with _ or a letter)

This ambiguous CFG does not dictate associativity, but let's assume operators have the associativity we introduced. Then the Concrete Syntax Tree (CST) for 2 ^ 3 ^ 1 | add(10, 2 ^ 1) | square | log; would be (simplified nodes shown on edges):

Before continuing to read this documentation, you may prefer to review the librdesc Glossary to ensure we are on the same page regarding terminology.

Defining Tokens and Nonterminals

librdesc can be fed char-by-char, or it can operate on tokens with a lexer. It is recommended to perform simpler parsing in the lexer (e.g. collecting numbers and identifiers) and then use librdesc to parse them.

Let's define tokens for punctuation, numbers, and identifiers. The lexer will collect numbers and identifiers and return them as a single token. This way, we avoid having to implement a digit list or an alphanumeric character list, making it easier to handle numbers and identifiers. Continuing the grammar definitions in src/grammar.h:

enum pm_tk {
        TK_NUM = 1,  // ⟨Num⟩
        TK_IDENT,  // ⟨Identifier⟩
        TK_LPAREN /* ( */, TK_RPAREN /* ) */,
        TK_PIPE /* | */, TK_CARET /* ^ */,
        TK_COMMA /* , */, TK_SEMI /* ; */,
};

Note: You are free to start token enum from any number, but in the example above we started from 1 because of the lexer we will use.

We define recursive nonterminals with additional _REST postfixed versions. Later on, we will discuss why this is required.

enum pm_nt {
        /* ⟨Stmt⟩, ⟨Expr⟩ */
        NT_STMT, NT_EXPR,
 
        /* ⟨Num⟩ (^ ⟨Num⟩)* */
        NT_EXPONENTIATION_EXPR, NT_EXPONENTIATION_EXPR_REST,
        /* ⟨FunctionCall⟩ (| ⟨FunctionCall⟩)* */
        NT_PIPE_EXPR, NT_PIPE_EXPR_REST,
 
        /* We will define ⟨PipeExpr⟩ as right-associative for now. Later, we
         * will use flip function to convert pipe operator left-associative. */
 
        /* ⟨Expr⟩ (, ⟨Expr⟩)* */
        NT_FUNCTION_ARG_LS, NT_FUNCTION_ARG_LS_REST,
 
        NT_FUNCTION_CALL,
};

Symbol Table

The symbol table is a 3D array of struct rdesc_grammar_symbol. The three dimensions are: the production (one row per nonterminal), the alternative (one row per alternative within that production), and the symbol (one cell per symbol in the alternative body). librdesc provides rule_macros.h to define this table without manually inserting sentinels.

Here is the list of core macros:

Macro
`r(...)`	Start a production, alternatives separated by alt
alt	Separator between alternatives within a production
`TK(X)`	Reference to terminal token TK_X
`NT(X)`	Reference to nonterminal NT_X

r(...) macro adds a sentinel to end of the alternative list and alt adds a sentinel to end of the alternative. That is why we added +1 in symbol table dimensions.

Using these macros, productions for statement and expression nonterminals can be defined as, in src/grammar.c:

struct rdesc_grammar_symbol pm_grammar[PM_PRODUCTION_COUNT]
                                      [PM_MAX_ALTERNATIVE_COUNT + 1]
                                      [PM_MAX_ALTERNATIVE_SIZE + 1] = {
/* <stmt> ::= */ r(
        NT(EXPR), TK(SEMI)
),
/* <expr> ::= */ r(
        NT(EXPONENTIATION_EXPR), TK(PIPE), NT(PIPE_EXPR)
alt     NT(EXPONENTIATION_EXPR)
)

Tokens and nonterminals are automatically prefixed by NT(X)/TK(X) macros. That is why we used those prefixes in enum definition but not in grammar table. See Name Mapping for Identifiers for details.

The grammar table follows ordered-choice semantics - the first matching alternative always wins. In the ambiguous grammar example above, if a pipe token is present, the first alternative is matched; otherwise, the second alternative is matched.

We will use / (ordered-choice) operator with BNF from now on to emphasize ordered-choice, so we can rewrite the grammar we defined so far as:

<stmt> ::= <expr> ";"
 
<expr> ::= <exponentiation_expr> "|" <pipe_expr>
         / <exponentiation_expr>

librdesc provides helper macros to standardize grammar definitions:

Macro
`ropt(...)`	Expands into two alternatives: an empty alternative (ε) and one with `...`
`rrr(...)`	Define right-recursive list rule
EPSILON	Empty alternative (ε)

See the documentation for rrr(...) before continuing.

/* <exponentiation_expr> ::= */
        rrr(EXPONENTIATION_EXPR, (TK(NUM)), (TK(CARET), TK(NUM))),
/* this also expands to <exponentiation_expr_rest> */
 
/* <pipe_expr> ::= */
        rrr(PIPE_EXPR, (NT(FUNCTION_CALL)), (TK(PIPE), NT(FUNCTION_CALL))),
/* similarly, this also expands to <pipe_expr_rest> */
 
/* <function_arg_ls> ::= */
        rrr(FUNCTION_ARG_LS, (NT(EXPR)), (TK(COMMA), NT(EXPR)))
/* <function_arg_ls_rest> */

The code above generates:

<exponentiation_expr> ::= TK_NUM <exponentiation_expr_rest>
 
<exponentiation_expr_rest> ::= "^" TK_NUM <exponentiation_expr_rest>
                             / E
 
<pipe_expr> ::= <function_call> <pipe_expr_rest>
 
<pipe_expr_rest> ::= "|" <function_call> <pipe_expr_rest>
                   / E
 
<function_arg_ls> ::= <expr> <function_arg_ls_rest>
 
<function_arg_ls_rest> ::= "," <expr> <function_arg_ls_rest>
                         / E

And the rest of the grammar is:

/* <function_call> ::= */ r(
        TK(IDENT), TK(LPAREN), NT(FUNCTION_ARG_LS), TK(RPAREN)
alt     TK(IDENT)
)

Which generates:

<function_call> ::= TK_IDENT "(" <function_arg_ls> ")"

/ TK_IDENT

Complete grammar exists at Pipe-Math Grammar Definition.

Initializing the Grammar

With the symbol table defined, we can begin implementing our program, starting by initializing the grammar in src/main.c:

#include <rdesc/grammar.h>
struct rdesc_grammar grammar;
 
/* Returns non-zero value on memory allocation error, abort the program in this
 * case. */
assert(rdesc_grammar_init_checked(&grammar,
                                  PM_PRODUCTION_COUNT,
                                  PM_MAX_ALTERNATIVE_COUNT,
                                  PM_MAX_ALTERNATIVE_SIZE,
                                  pm_grammar) == 0);

There are two grammar initialization functions: rdesc_grammar_init() and its statically checked version rdesc_grammar_init_checked(). The latter checks whether or not provided dimensions are valid, raising an error if not; but cannot be used with dynamically-allocated grammar tables.

We can dump our grammar rules in BNF format via the rdesc_dump_bnf() utility. Let's add an option for printing out grammar rules, but first we need to define names of tokens and nonterminals:

const char *tk_names[] = {
        "\0",  /* We skip token with id 0, as we started enum from 1. */
        "@num", "@ident",  /* rdesc_dump_bnf utility wraps tokens with double
                            * quotes ("). Use @ to suppress automatic wrap. */
        "(", ")", "|", "^", ",", ";",
};
 
const char *nt_names[] = {
        "stmt", "expr",
 
        "exponentiation_expr", "exponentiation_expr_rest",
        "pipe_expr", "pipe_expr_rest",
 
        "function_arg_ls", "function_arg_ls_rest",
 
        "function_call",
};

Then, add the option to the main function:

#include <rdesc/util.h>
if (argc > 1 && strcmp(argv[1], "dump_bnf") == 0) {
        rdesc_dump_bnf(stdout, &grammar, tk_names, nt_names);
}

Now, you can dump the grammar using ./pm dump_bnf.

Lexer

Lexical analysis is out of scope for this document. We will use exblex, a minimal single-header lexer included with librdesc's examples, to keep the example self-contained.

exblex tokenizes input in three ways:

Words (w): sequences of alphanumeric characters and underscores, matched as TK_IDENT.
Digits (d): sequences of numeric digits, matched as TK_NUM.
Punctuation: single characters looked up in a token table, matched to their corresponding token id.

Token IDs are determined by position in the token table. Index 0 is reserved by exblex for EOF/NOTOKEN, so TK_NUM must start from 1.

const char exblex_tks[] = {
        '\0',  /* exblex requires null termination at the beginning and end */
        'd',  /* num */ 'w',  /* ident */
        '(', ')', '|', '^', ',', ';',
        '\0'
};

Punctuation tokens only return a token ID in exblex, but identifiers and numbers also return seminfo containing their string value.

Initializing the Parser

Having the grammar and lexer ready, we now can initialize our parser:

#include <rdesc/rdesc.h>
struct rdesc parser;
 
/* Similar to grammar_init, this function also returns non-zero in case of
 * memory allocation failure. */
assert(rdesc_init(&parser, &grammar, sizeof(char *), tk_destroyer) == 0);

The parser requires a method pointer to a destroyer function. See Memory Ownership Semantics for details.

void tk_destroyer(uint16_t id, void *seminfo)
{
        if (id == TK_NUM || id == TK_IDENT) {
                char *string;
 
                /* seminfo is a pointer to user-specific data, which is char *
                 * in our program. Extract the char * from void * by
                 * type-punning. */
                memcpy(&string, seminfo, sizeof(char *));
 
                free(string);
        }
}

This snippet reads a line from stdin and feeds tokens to the parser, handling EOF and OOM:

struct exblex lexer;
enum rdesc_result res;
 
char buf[4096];
 
if (fgets(buf, 4096, stdin) == NULL)
        buf[0] = '\0';
 
exblex_init(&lexer, buf, exblex_tks);
 
/* Memory allocation assertion. */
assert(rdesc_start(&parser, NT_STMT) == 0);
 
do {
        uint16_t tk_id = exblex_next(&lexer);
        char *tk_seminfo = exblex_current_seminfo(&lexer);
 
        /* EOF before match */
        if (tk_id == 0) {
                res = RDESC_NOMATCH;
                break;
        }
 
        res = rdesc_pump(&parser, tk_id, &tk_seminfo);
 
        /* Retry if allocation failed. */
        while (res == RDESC_ENOMEM)
                res = rdesc_resume(&parser);
} while (res == RDESC_CONTINUE);

For debugging, let's add an option for CST dumping:

#include <rdesc/util.h>
if (argc > 1 && strcmp(argv[1], "dump_cst") == 0 && res == RDESC_READY) {
        rdesc_dump_cst(stdout, &parser, node_printer);
}

The CST dump function requires a function pointer for printing out nodes. We will discuss which attributes nonterminals and tokens have. In the printer function, we used type, id, and seminfo fields:

#include <rdesc/cst_macros.h>
 
void node_printer(FILE *out, const struct rdesc_node *node)
{
        if (rtype(node) == RDESC_TOKEN) {
                if (rid(node) == TK_NUM || rid(node) == TK_IDENT) {
                        char *string;
 
                        memcpy(&string, rseminfo(node), sizeof(char *));
 
                        fprintf(out, "[shape=box,label=<%s>]", string);
                } else {
                        fprintf(out, "[shape=box,label=<%s>]", tk_names[rid(node)]);
                }
        } else {
                fprintf(out, "[label=\"%s\"]", nt_names[rid(node)]);
        }
 
}

Now, you can dump the CST as a Graphviz dot file using ./pm dump_cst. I reccomend examinig the dumped CST file using online Graphviz visualizers.

Interpreter

With the grammar initialised, the lexer providing tokens, and the parser matching a statement, we can now evaluate the resulting CST. We traverse the tree using the following macros from cst_macros.h:

Macro	Used for
`rparent()`	Both	Pointer to parent of the node
`rtype()`	Both	Node type: RDESC_TOKEN or RDESC_NONTERMINAL
`rid()`	Both	Token or nonterminal ID
`rseminfo()`	Tokens	Additional seminfo of the token
`ralt_idx()`	Nonterminals	Index of the matched alternative
`rchild_count()`	Nonterminals	Total number of child nodes
`rchild()`	Nonterminals	Get child by its index

Using the macros above, we can write a recursive function evaluating Pipe-Math statements, called from the program entry:

if (res == RDESC_READY) {
        printf(">> %.2lf\n", pm_interpreter(&parser, rdesc_root(&parser)));
}

The following function evaluates each CST node by dispatching on nonterminal ID and alternative index, in src/interpreter.c:

#include <rdesc/rdesc.h>
#include <rdesc/cst_macros.h>
#include <rdesc/util.h>
 
/* We only call this function with nonterminals, starting from <stmt>. */
double pm_interpreter(struct rdesc *p,
                      struct rdesc_node *n)
{
        switch (rid(n)) {
        case NT_STMT:
                /* continue evaluation from first child (<expr>) */
                return pm_interpreter(p, rchild(p, n, 0));
 
        case NT_EXPR:
                switch (ralt_idx(n)) {
                case 0:  /* <exponentiation_expr> "|" <pipe_expr>*/
                        /* flip <pipe_expr> to left-associative order */
                        rdesc_flip_left(p, n, 2);
 
                        /* Start pipe evaluation from left-to-right. */
                        return pm_interpret_pipe(p,
                                        rchild(p, n, 2),
                                        pm_interpreter(p, rchild(p, n, 0)));
                default:  /* <exponentiation_expr> */
                        return pm_interpreter(p, rchild(p, n, 0));
                }
 
        case NT_EXPONENTIATION_EXPR:
                /* <num> <exponentiation_expr_rest> */
                return pow(pm_extract_num(rchild(p, n, 0)),
                           pm_interpreter(p, rchild(p, n, 1)));
 
        case NT_EXPONENTIATION_EXPR_REST:
                switch (ralt_idx(n)) {
                case 0:  /* "^" <num> <exponentiation_expr_rest> */
                        return pow(pm_extract_num(rchild(p, n, 1)),
                                   pm_interpreter(p, rchild(p, n, 2)));
                default:  /* E */
                        return 1;
                }
        }
 
        return 0;  // GCOVR_EXCL_LINE: Unreachable
}

We free the seminfo field when extracting numbers from token nodes. After a successful match, the caller is responsible for destroying token seminfo, as described in Memory Ownership Semantics.

double pm_extract_num(struct rdesc_node *num)
{
        char *seminfo;
 
        /* seminfo stores a pointer to number, so it is a char **.
         *
         * Aside: We need to use memcpy to retrieve char * from char ** to not
         * break strict-aliasing rule. */
        memcpy(&seminfo, rseminfo(num), sizeof(char *));
 
        double seminfo_num = atof(seminfo);
 
        free(seminfo);
 
        return seminfo_num;
}

Pipe operators are semantically left-associative: a | f | g means (a | f) | g. However, rrr() generates right-recursive rules, so the matched <pipe_expr> tree is right-associative by structure. Before evaluating, we call rdesc_flip_left() to rotate the subtree in-place, converting it to left-associative order. Evaluation then proceeds left-to-right:

/* Called for <pipe_expr>. */
double pm_interpret_pipe(struct rdesc *p,
                         struct rdesc_node *pipe,
                         double lhs  /* value at left-hand side of pipe */)
{
        switch (ralt_idx(pipe)) {
        case 0:  /* <pipe_expr> "|" <function_call> */
                return pm_interpret_function(p, rchild(p, pipe, 2),
                               pm_interpret_pipe(p, rchild(p, pipe, 0), lhs));
 
        default:  /* <function_call> */
                return pm_interpret_function(p, rchild(p, pipe, 0), lhs);
        }
}

Important: After the flip, node IDs and alternative indices reflect the new tree structure. The function converts the structure represented by:

<pipe_expr> ::= <function_call> <pipe_expr_rest>
 
<pipe_expr_rest> ::= "|" <function_call> <pipe_expr_rest>
                   / E

into the left-recursive equivalent:

<pipe_expr> ::= <pipe_expr> "|" <function_call>

/ <function_call>

Now the list starts and continues with <pipe_expr> in alternative 0, and ends with <pipe_expr> in alternative 1. See rdesc_flip_left() for details.

That's it! As mentioned earlier, the sample code for this document is located in the docs/quick-start/src/ folder. For brevity, we did not include the implementation for evaluation of the <function_call> expression, but you can find it in that folder.

Exercise: Visualizing rdesc_flip_left

To deepen your understanding of how rdesc_flip_left restructures the tree, try the following:

Define a minimal grammar using rrr() for a list, wrapped in a start production that consumes a terminating token (e.g. ;). The terminating token is required as the _REST nonterminal matches epsilon only when the next token does not match the list separator, so the parser reaches RDESC_READY only after consuming the end symbol.
Parse an input string such as a | b | c ;.
Dump the CST with rdesc_dump_cst() before calling rdesc_flip_left().
Call rdesc_flip_left() and dump the CST again.
Compare the two Graphviz outputs. Notice how node IDs and alternative indices change after the flip.

For more complete examples, see Examples.

One Last Step

You should call destroy functions both for the grammar and the parser instances. The parser's destructor calls token_destroyer for tokens the parser owns:

rdesc_destroy(&parser);

rdesc_grammar_destroy(&grammar);

rdesc_grammar_destroy

void rdesc_grammar_destroy(struct rdesc_grammar *grammar)

Frees resources allocated by the grammar.

rdesc_destroy

void rdesc_destroy(struct rdesc *parser)

Frees memory allocated by the parser and destroys the parser instance.