Recursive descent parsing

A recursive descent parser is a top-down parser where each production rule in the grammar maps directly to a method in code. The parser scans input left-to-right, building the leftmost derivation, and uses a small fixed amount of lookahead to decide which production to apply at each step.

This produces parsers that are easy to read, test, and debug — the code structure mirrors the grammar structure exactly.

LL(1) grammars and backtracking

What LL(1) means: the parser scans left-to-right (first L), produces the leftmost derivation (second L), and needs only one byte of lookahead (the 1) to decide which production to apply at each step.

Left-to-right (first L): each byte is read exactly once, in order. The position pointer only ever moves forward:

input:   3   +   4   *   2
         ✓   ✓
                 ↑
                pos=2   peek() = '4'

Characters to the left of the pointer have been consumed and cannot be revisited. The one character at the pointer is the lookahead — it is inspected but not yet consumed.

Leftmost derivation (second L): the parser always resolves the leftmost unresolved part of the grammar first — it fully commits to a production before moving right. In a recursive descent parser the call stack at any moment is the leftmost derivation: the deepest method currently executing is the leftmost non-terminal still being matched:

call stack while parsing "3 + 4" at pos=0:
  parseExpr()         ← Expr is the leftmost unresolved non-terminal
    parseTerm()       ← Term is leftmost inside Expr
      parseFactor()   ← Factor is leftmost inside Term
        parseNumber() → reads '3', returns

Once parseNumber() returns, the stack unwinds to parseTerm, then parseExpr, which then advances right to the +. The parser never revisits the left.

Lookahead (the 1): that single byte is enough to decide which production to apply next — no reading ahead further, no backtracking if the choice turns out to be wrong:

LL(1) decision flowchart: peek one byte, if it matches alternative A parse A, if it matches alternative B parse B, otherwise throw RDParserException with no backtracking

Grammars that are LL(1): configuration files, data serialisation formats, expression languages with explicit precedence, network protocol frames, and grammars written specifically to be parsed top-down.

rdp-gen targets LL(1) grammars. It generates parsers that use a single byte of lookahead with no backtracking. Feeding it a non-LL(1) grammar produces a parser that silently returns incorrect results, not a helpful error. The one exception is left recursion — any rule that references itself as its first symbol — which rdp-gen detects at generation time and rejects.

Other grammars that are not LL(1):

  • Ambiguous grammars — where a single input string has more than one parse tree.
  • Grammars requiring more than one byte of lookahead — where two alternatives share a long common prefix.
  • Most general programming language grammars — C, C++, Java, and similar languages are not LL(1) for several compounding reasons (semantic disambiguation, multi-token lookahead, dangling-else ambiguity). These are handled by LR, LALR, or GLR parsers.

Left recursion can always be eliminated by rewriting the grammar to use iteration ({...} / X, {X}), which is what LL(1) grammars require.

The base class is more general. RDParser exposes restorePosition, which lets hand-written subclasses implement backtracking and parse grammars beyond LL(1). rdp-gen does not emit backtracking code — that is a hand-crafting concern.

Scannerless parsing

By default, rdp-gen generates a scannerless parser — one that operates directly on the raw input string with no separate tokenisation step. Every terminal in the grammar becomes a direct byte comparison in the generated code; there is no lexer phase, no token stream, and no split between "what is a token?" and "what does this token mean?".

This is the right default. It keeps the generated output simple, the pipeline short, and the grammar self-contained. For most grammars it is also fast enough.

The one case where scannerless parsing has a measurable cost is grammars with many large character-class rules — rules that enumerate 80+ individual characters as alternatives. In those cases a pre-tokenisation step can reduce the cost by 3–5×. See Tokenising for performance for the investigation that produced TokenRDParser and the --lexer span scaffold.