|
librdesc
|
Let's take a quick look at how the functions in rdesc's public API affect its state:
It is fair to say that this FSM describes the inner workings of the parser reasonably well when the user does not provide invalid input and the grammar being matched is simple. However, in fact rdesc is a "pushdown automaton," and not a simple one. Multiple stacks affect its state transmission!
rdesc_resume()). The CST stack, on the other hand, is the structure containing the matched parse tree. It contains nonterminal and token nodes, which is the structure we access via cst_macros.h.This generally represents the "ready" state for starting a parse. There are also two variants of readiness state: clean and dirty. The clean variant indicates that the backtracking stack is empty, while the dirty variant indicates the opposite. Both the Ready and Running states have those two variants.
There are multiple ways to get to Ready state:
rdesc_init() / rdesc_reset(): These are the core methods that set the parser to Ready state. A newly initialized or reset parser is always in clean Ready state.rdesc_pump(): When the match finishes (succeeds or fails), the parser returns to the ready state. When returning due to a RDESC_READY condition, the CST is ready to fetch using rdesc_root(), whereas in the case of a DESC_NOMATCH, this does not occur, and NULL is returned.The transition from Ready state to the Running state can be initiated using rdesc_start(). If ready_start() returns a non-zero result, this indicates a memory allocation error, i.e. startup failure. You can try starting again.
In Running state, the parser engine is ready to consume tokens, with one small caveat: If the parser started from dirty Ready state the caller MUST call rdesc_resume() before pushing any tokens.
As described in Ready, the dirty state indicates that there are tokens on the backtracking stack – that is, tokens that have been previously pushed but not yet consumed. The older tokens must be consumed before a new token can be fed. The rdesc_resume() function enables the match continuation from the backtracking stack; if the stack is empty, it is a no-op.
rdesc_pump() and rdesc_resume() perform exactly the same function. The only difference between them is that the former takes in a new token as an argument, while the latter continues from the tokens with the backtracking stack.
If rdesc encounters a memory allocation error while attempting to push something onto the backtracking or CST stack, it enters this state. As long as it remains in this state, rdesc can only use the token that caused the allocation failure during parsing.
The caller can continue attempting to match until the rdesc_resume() returns a result other than RDESC_ENOMEM, or it can abort the parse by destroying all tokens using rdesc_reset() or rdesc_destroy().
To track which state the parser is in, it's usually sufficient to check the return values of the functions. If pump/resume returns ...
rdesc_root().rdesc_reset() and start a clean parse.In cases where RDESC_READY is returned, it is necessary to find out whether the state is dirty or clean.
Most of you probably will not use more than one start symbol, you'll likely get into to a clean Ready state after the RDESC_READY. It is only possible to transition to the dirty Ready state with RDESC_READY when initiating a parse while in the dirty Ready state.
Consider the case, you have the grammar:
First, you set the start symbol to A and begin parsing. Your lexer provided the following token sequence: α and γ. Eventually, the parsing process fails at token γ; and all tokens are pushed onto the backtracking stack, you return to the dirty Ready state. Then you decide to set the start symbol to B and start parsing again, resulting in dirty Ready state. While in the dirty Running state, your first call will be rdesc_resume(). Aha! Consuming the first token from the backtracking stack completed the parsing of B. The resume operation returned RDESC_READY, but you entered the dirty Ready state, not the clean one.