Tokenising for performance
rdp-gen generates a scannerless parser — one that reads the source string
character by character, with no separate tokenisation step. That design is correct,
minimal, and in the early versions was the only option. This document traces the
experimental path that started with that baseline, asked whether a pre-tokenisation
step could do better, and ended with a concrete answer and a new base class —
TokenRDParser — that captures the result.
How the scannerless parser recognises tokens
Every terminal in your grammar becomes a direct byte comparison in the generated code. A rule like
wsp = ' ' | '\t' | '\n' | '\r';
compiles into four separate attempts: try ' ', if that fails restore position
and try '\t', and so on. For four alternatives that's cheap. But consider a
rule that accepts any printable character:
TextChar = ' ' | '!' | '"' | '#' | '$' | ... (* 90+ alternatives *);
At every position in the input the parser tries each alternative in turn, backing
up on failure. With 90 alternatives, that's up to 90 byte comparisons per character
just to classify a single TextChar — and rules like this appear repeatedly in
grammars that handle free-form text, markup, configuration files, or anything
with large character classes.
For most grammars this is fast enough. For grammars with many large character-class rules it becomes the dominant cost.
What would a tokeniser do differently?
A tokeniser makes a single forward pass over the input, using the character's position in the ASCII table rather than a list of alternatives. Classifying a character as "printable" becomes one range check instead of 90 comparisons. The key observation is that tokenisation has two distinct sub-problems that are often conflated:
- Span detection — where does the current token start and end? (WORD, NUMBER, STRING, PUNCT)
- Token typing — what grammar-specific type does this span have?
(
NAME,LIT,EQ,SEMI, …)
Separating these gives three alternative strategies beyond the scannerless baseline.
The four approaches
| Approach | How it works |
|---|---|
| Scannerless | Characters → AST directly; one byte comparison per terminal alternative |
| Hand lexer | Single-pass char-code loop that produces typed tokens; tightest possible code |
| Regex lexer | Alternating regex compiled to a deterministic finite automaton (DFA) by the JS engine; simple to write |
| Span + classify | Two passes: spanTokenize finds boundaries (WORD/NUMBER/STRING/PUNCT), then classify applies grammar-specific typing |
The investigation — a controlled experiment
To isolate the effect cleanly we needed two inputs:
- ebnf-meta — the real-world EBNF grammar used internally by
rdp-genitself (3 KB, 20 rules, a mix of structural and character-class rules) - stress — a deliberately pathological grammar designed to maximise the cost of the scannerless approach
The stress grammar was generated by benchmarks/gen-stress.mjs. It has 16
character-class rules each with 80–94 single-character alternatives — things like:
HeadChar = ' ' | '!' | '"' | '#' | ... (* 93 alternatives, every printable char except '\n' *);
TextChar = ' ' | '!' | '#' | ... (* 89 alternatives, excluding '\n', '*', '_', '`', '[', '!' *);
(* ... 14 more rules of the same form *)
If the speedup ratio stayed consistent across both inputs it would confirm that character-class rule size — not total grammar size, rule count, or nesting depth — is the real discriminator. If it varied significantly, there would be other factors at play.
All four approaches were validated to produce identical parse results (same structural checksum: rule count, terminal count, non-terminal count, total node count) before timing began.
Benchmark results
The speedup ratios are nearly identical across both inputs. The stress grammar (13 KB, 84 rules, 1,633 terminals) produces almost exactly the same relative results as the real ebnf-meta grammar. This confirms the hypothesis: the bottleneck is character-class rule structure, and it scales with the number and size of those rules, not with overall grammar size.
A few things stand out:
- The hand lexer is the fastest at 4.6–5×, because the tokeniser and parser are written together with perfect knowledge of the grammar's token structure.
- The span + classify pipeline reaches 3.4–4×, close to the hand-written ceiling.
- The regex lexer is slower than the scannerless baseline at 0.78–0.81×. The regex engine's startup cost and the pre-pass needed to handle nested comments outweighs the DFA's character-class advantage at these input sizes.
Mechanisability — which approaches can rdp-gen generate?
This is where the strategies diverge beyond raw performance numbers.
Scannerless is fully automatic. rdp-gen generates the complete parser from
the grammar with no additional input. This is the current default.
Hand lexer gives the best performance but cannot be mechanically generated. A hand lexer requires the author to annotate which rules are lexical (defining character sets) versus syntactic (defining structure). That distinction is not present in standard EBNF or ABNF — it must be supplied separately. Without it, a generator cannot know which rules to collapse into token types and which to keep as recursive-descent methods.
Regex lexer can be partially generated (the regex can be derived from the grammar's terminal set) but has two practical problems: nested comment styles require a pre-pass that the regex cannot handle, and it is consistently slower than the scannerless baseline on these inputs.
Span + classify is fully mechanisable:
spanTokenizeis effectively grammar-agnostic. It needs only three pieces of information — which characters start an identifier, which continue one, and what the comment and string delimiters are. These are parameterised viaSPAN_OPTSand can be inferred from the grammar's terminal set with high confidence for the common cases.classifyis a purely mechanical mapping from the grammar's terminal set. Every single-character punctuation terminal becomes acasein aswitch. Every multi-character all-letter terminal becomes an entry inKEYWORDS. No grammar annotation is needed.
The tradeoff
| Speed | Mechanisable | Manual work | |
|---|---|---|---|
| Scannerless (default) | 1× | ✓ fully | none |
| Hand lexer | ~5× | ✗ | annotate lexical vs syntactic rules; write tokeniser by hand |
| Regex lexer | 0.8× | partial | handle nested comments manually |
| Span + classify | ~4× | ✓ fully | fill in parse method stubs |
The span + classify pipeline captures most of the available speedup while
remaining fully derivable from the grammar. That makes it the right choice for
a generated scaffold: it gives users a concrete, working starting point that
they complete by filling in the parse logic — exactly the same model as every
other rdp-gen scaffold.
The hand lexer remains the ceiling for maximum-performance scenarios, but it requires authoring effort proportional to grammar size and is outside the scope of what a generator can produce without additional annotations.
The two models
The scannerless parser reads characters directly and produces a typed ParseTree in one pass. The span-lexer pipeline separates boundary detection (grammar-agnostic) from token typing (generated) from recursive descent (your stubs) — three composable stages where each has a clear, narrow responsibility.
Generating the scaffold
rdp-gen grammar.ebnf --parser-name MyParser --lexer span --output src/MyLexer.tsThe generated file contains:
| Component | Grammar-specific? | Mechanisable? | Your work |
|---|---|---|---|
TT — token-type enum | Yes | ✓ fully — derived from the grammar's terminal set | None |
spanTokenize — span detector | No — grammar-agnostic | — | Adjust SPAN_OPTS if your grammar uses non-default delimiters or comment style |
classify — span → token mapper | Yes | ✓ fully — every punct terminal becomes a switch case; every keyword becomes a KEYWORDS entry | None |
{Base}TokenParser — recursive descent (extends TokenRDParser) | Yes | ✓ partially — one stub per production rule | Fill in each #parse<Rule>() body |
Because spanTokenize is grammar-agnostic, the same boundary-detection logic would work unchanged for any grammar — only SPAN_OPTS needs adjusting. classify and {Base}TokenParser are fully derived from the grammar and are where the grammar-specific work lives. Token navigation (peekToken(), eatToken(), matchToken()) is inherited from TokenRDParser.
Unlike the generated parser, the scaffold is not regenerated — it is a
one-time starting point. Fill in the #parse<Rule>() methods with the logic
for each production rule and change the static parse() return type from
unknown to your concrete result type.
# Compare the scannerless generated parser with the span-lexer scaffold:
rdp-gen grammar.ebnf --parser-name MyParser --output src/MyParser.ts
rdp-gen grammar.ebnf --parser-name MyParser --lexer span --output src/MyLexer.tsFor grammars where character-class rules are not a significant factor, the scannerless generated parser is the simpler choice and plenty fast. Reach for the span-lexer scaffold when profiling shows the parser is a bottleneck and your grammar has rules matching large sets of individual characters.