Tokenising for performance

rdp-gen generates a scannerless parser — one that reads the source string character by character, with no separate tokenisation step. That design is correct, minimal, and in the early versions was the only option. This document traces the experimental path that started with that baseline, asked whether a pre-tokenisation step could do better, and ended with a concrete answer and a new base class — TokenRDParser — that captures the result.

How the scannerless parser recognises tokens

Every terminal in your grammar becomes a direct byte comparison in the generated code. A rule like

wsp = ' ' | '\t' | '\n' | '\r';

compiles into four separate attempts: try ' ', if that fails restore position and try '\t', and so on. For four alternatives that's cheap. But consider a rule that accepts any printable character:

TextChar = ' ' | '!' | '"' | '#' | '$' | ... (* 90+ alternatives *);

At every position in the input the parser tries each alternative in turn, backing up on failure. With 90 alternatives, that's up to 90 byte comparisons per character just to classify a single TextChar — and rules like this appear repeatedly in grammars that handle free-form text, markup, configuration files, or anything with large character classes.

For most grammars this is fast enough. For grammars with many large character-class rules it becomes the dominant cost.

What would a tokeniser do differently?

A tokeniser makes a single forward pass over the input, using the character's position in the ASCII table rather than a list of alternatives. Classifying a character as "printable" becomes one range check instead of 90 comparisons. The key observation is that tokenisation has two distinct sub-problems that are often conflated:

Span detection — where does the current token start and end? (WORD, NUMBER, STRING, PUNCT)
Token typing — what grammar-specific type does this span have? (NAME, LIT, EQ, SEMI, …)

Separating these gives three alternative strategies beyond the scannerless baseline.

The four approaches

Approach	How it works
Scannerless	Characters → AST directly; one byte comparison per terminal alternative
Hand lexer	Single-pass char-code loop that produces typed tokens; tightest possible code
Regex lexer	Alternating regex compiled to a deterministic finite automaton (DFA) by the JS engine; simple to write
Span + classify	Two passes: `spanTokenize` finds boundaries (WORD/NUMBER/STRING/PUNCT), then `classify` applies grammar-specific typing

The investigation — a controlled experiment

To isolate the effect cleanly we needed two inputs:

ebnf-meta — the real-world EBNF grammar used internally by rdp-gen itself (3 KB, 20 rules, a mix of structural and character-class rules)
stress — a deliberately pathological grammar designed to maximise the cost of the scannerless approach

The stress grammar was generated by benchmarks/gen-stress.mjs. It has 16 character-class rules each with 80–94 single-character alternatives — things like:

HeadChar = ' ' | '!' | '"' | '#' | ... (* 93 alternatives, every printable char except '\n' *);
TextChar = ' ' | '!' | '#' | ... (* 89 alternatives, excluding '\n', '*', '_', '`', '[', '!' *);
(* ... 14 more rules of the same form *)

If the speedup ratio stayed consistent across both inputs it would confirm that character-class rule size — not total grammar size, rule count, or nesting depth — is the real discriminator. If it varied significantly, there would be other factors at play.

All four approaches were validated to produce identical parse results (same structural checksum: rule count, terminal count, non-terminal count, total node count) before timing began.

Benchmark results

The speedup ratios are nearly identical across both inputs. The stress grammar (13 KB, 84 rules, 1,633 terminals) produces almost exactly the same relative results as the real ebnf-meta grammar. This confirms the hypothesis: the bottleneck is character-class rule structure, and it scales with the number and size of those rules, not with overall grammar size.

A few things stand out:

The hand lexer is the fastest at 4.6–5×, because the tokeniser and parser are written together with perfect knowledge of the grammar's token structure.
The span + classify pipeline reaches 3.4–4×, close to the hand-written ceiling.
The regex lexer is slower than the scannerless baseline at 0.78–0.81×. The regex engine's startup cost and the pre-pass needed to handle nested comments outweighs the DFA's character-class advantage at these input sizes.

Mechanisability — which approaches can rdp-gen generate?

This is where the strategies diverge beyond raw performance numbers.

Scannerless is fully automatic. rdp-gen generates the complete parser from the grammar with no additional input. This is the current default.

Hand lexer gives the best performance but cannot be mechanically generated. A hand lexer requires the author to annotate which rules are lexical (defining character sets) versus syntactic (defining structure). That distinction is not present in standard EBNF or ABNF — it must be supplied separately. Without it, a generator cannot know which rules to collapse into token types and which to keep as recursive-descent methods.

Regex lexer can be partially generated (the regex can be derived from the grammar's terminal set) but has two practical problems: nested comment styles require a pre-pass that the regex cannot handle, and it is consistently slower than the scannerless baseline on these inputs.

Span + classify is fully mechanisable:

spanTokenize is effectively grammar-agnostic. It needs only three pieces of information — which characters start an identifier, which continue one, and what the comment and string delimiters are. These are parameterised via SPAN_OPTS and can be inferred from the grammar's terminal set with high confidence for the common cases.
classify is a purely mechanical mapping from the grammar's terminal set. Every single-character punctuation terminal becomes a case in a switch. Every multi-character all-letter terminal becomes an entry in KEYWORDS. No grammar annotation is needed.

The tradeoff

	Speed	Mechanisable	Manual work
Scannerless (default)	1×	✓ fully	none
Hand lexer	~5×	✗	annotate lexical vs syntactic rules; write tokeniser by hand
Regex lexer	0.8×	partial	handle nested comments manually
Span + classify	~4×	✓ fully	fill in parse method stubs

The span + classify pipeline captures most of the available speedup while remaining fully derivable from the grammar. That makes it the right choice for a generated scaffold: it gives users a concrete, working starting point that they complete by filling in the parse logic — exactly the same model as every other rdp-gen scaffold.

The hand lexer remains the ceiling for maximum-performance scenarios, but it requires authoring effort proportional to grammar size and is outside the scope of what a generator can produce without additional annotations.

The two models

Scannerless vs span-lexer pipeline composition

The scannerless parser reads characters directly and produces a typed ParseTree in one pass. The span-lexer pipeline separates boundary detection (grammar-agnostic) from token typing (generated) from recursive descent (your stubs) — three composable stages where each has a clear, narrow responsibility.

Generating the scaffold

rdp-gen grammar.ebnf --parser-name MyParser --lexer span --output src/MyLexer.ts

The generated file contains:

Component	Grammar-specific?	Mechanisable?	Your work
`TT` — token-type enum	Yes	✓ fully — derived from the grammar's terminal set	None
`spanTokenize` — span detector	No — grammar-agnostic	—	Adjust `SPAN_OPTS` if your grammar uses non-default delimiters or comment style
`classify` — span → token mapper	Yes	✓ fully — every punct terminal becomes a `switch` case; every keyword becomes a `KEYWORDS` entry	None
`{Base}TokenParser` — recursive descent (extends `TokenRDParser`)	Yes	✓ partially — one stub per production rule	Fill in each `#parse<Rule>()` body

Because spanTokenize is grammar-agnostic, the same boundary-detection logic would work unchanged for any grammar — only SPAN_OPTS needs adjusting. classify and {Base}TokenParser are fully derived from the grammar and are where the grammar-specific work lives. Token navigation (peekToken(), eatToken(), matchToken()) is inherited from TokenRDParser.

Unlike the generated parser, the scaffold is not regenerated — it is a one-time starting point. Fill in the #parse<Rule>() methods with the logic for each production rule and change the static parse() return type from unknown to your concrete result type.

# Compare the scannerless generated parser with the span-lexer scaffold:
rdp-gen grammar.ebnf --parser-name MyParser --output src/MyParser.ts
rdp-gen grammar.ebnf --parser-name MyParser --lexer span --output src/MyLexer.ts

For grammars where character-class rules are not a significant factor, the scannerless generated parser is the simpler choice and plenty fast. Reach for the span-lexer scaffold when profiling shows the parser is a bottleneck and your grammar has rules matching large sets of individual characters.