Lexer Architecture
Janus Lexer Architecture
Section titled “Janus Lexer Architecture”Status: Active (Dual-Lexer) Target: Single ASTDB-based lexer for stable beta release Last Updated: 2026-01-25
Overview
Section titled “Overview”The Janus compiler maintains two separate lexers that serve different architectural purposes. This document explains why they exist, their differences, and the planned unification strategy.
The Two Lexers
Section titled “The Two Lexers”1. janus_tokenizer (Traditional Parser Path)
Section titled “1. janus_tokenizer (Traditional Parser Path)”Location: compiler/libjanus/janus_tokenizer.zig
Used By:
janus_parser.zig(recursive descent parser)- All E2E compilation tests
- LLVM codegen pipeline
Pipeline:
Source → Tokenizer → Parser → AST → QTJIR → LLVM IR → Executable| Aspect | Description |
|---|---|
| Token Storage | Token struct with raw lexeme: []const u8 slices |
| Memory Model | Tokens own source slices directly |
| Trivia Handling | Not captured (whitespace/comments discarded) |
| Incremental Support | No |
| Profile Support | :min, :go, :sovereign keywords |
Token Structure:
pub const Token = struct { type: TokenType, // Direct enum (126 variants) lexeme: []const u8, // Raw source slice span: SourceSpan, // Start/end position};2. RegionLexer (ASTDB Path)
Section titled “2. RegionLexer (ASTDB Path)”Location: compiler/astdb/lexer.zig
Used By:
region.zig(ASTDB region-based parsing)semantic_analyzer.zig(type checking)- LSP server (incremental updates)
Pipeline:
Source → RegionLexer → ASTDB Snapshot → Columnar Queries → Semantic Analysis| Aspect | Description |
|---|---|
| Token Storage | Token struct with str: ?StrId (interned string ID) |
| Memory Model | String interning via StrInterner (deduplication) |
| Trivia Handling | Separate Trivia array with trivia_lo/trivia_hi indices |
| Incremental Support | Yes (region boundaries: start_pos, end_pos) |
| Designed For | Columnar database queries, incremental parsing |
Token Structure:
pub const Token = struct { kind: TokenKind, // Enum with 220+ variants str: ?StrId, // Interned string (null for punctuation) span: SourceSpan, // Byte offsets + line/column trivia_lo: u32, // Index into trivia array trivia_hi: u32, // Exclusive end of trivia};Architectural Comparison
Section titled “Architectural Comparison”| Aspect | janus_tokenizer | RegionLexer |
|---|---|---|
| Origin | Traditional compiler front-end | ASTDB columnar database |
| Memory Model | Token owns lexeme slices | String interning (deduplication) |
| Trivia | Discarded | Preserved separately |
| Incremental | No | Yes (region boundaries) |
| Use Case | AST generation, compilation | Semantic queries, LSP |
| Optimization | Speed | Memory efficiency |
Why Two Lexers?
Section titled “Why Two Lexers?”Historical Reasons
Section titled “Historical Reasons”- janus_tokenizer was built first for the traditional compilation pipeline
- RegionLexer was added later for ASTDB’s incremental parsing requirements
- Different consumers evolved with different data model expectations
Technical Reasons
Section titled “Technical Reasons”- Different Storage Models: Raw slices vs. interned strings
- Different Consumers: Parser expects sequential stream; ASTDB expects columnar data
- Different Optimization Goals: Speed vs. memory efficiency + incrementality
Current State (2026.1.x)
Section titled “Current State (2026.1.x)”Both lexers support the same token types:
- All operators (arithmetic, logical, bitwise, comparison)
- All keywords (
:min,:go,:sovereignprofiles) - Numeric literals: decimal, hex (
0xFF), binary (0b1010), octal (0o777) - String literals, identifiers, punctuation
This consistency is actively maintained - any new token support must be added to both lexers.
Unification Strategy (Future)
Section titled “Unification Strategy (Future)”Target: Stable beta release should use single ASTDB-based lexer
Phase 1: Adapter Layer
Section titled “Phase 1: Adapter Layer”Create a thin adapter that converts RegionLexer output to janus_tokenizer format:
Source → RegionLexer → Adapter → Parser (unchanged)Phase 2: Parser Migration
Section titled “Phase 2: Parser Migration”Modify parser to consume ASTDB tokens directly:
Source → RegionLexer → ASTDB Snapshot → Parser → ASTPhase 3: Deprecation
Section titled “Phase 3: Deprecation”Remove janus_tokenizer once all consumers migrate to ASTDB path.
Benefits of Unification
Section titled “Benefits of Unification”- Single source of truth for tokenization
- Automatic incremental parsing for all paths
- Memory-efficient string interning everywhere
- Trivia preservation for formatting tools
Challenges
Section titled “Challenges”- Parser expects raw lexemes, ASTDB uses interned IDs
- Performance regression risk during transition
- Test coverage must be maintained
Files Reference
Section titled “Files Reference”| File | Purpose |
|---|---|
compiler/libjanus/janus_tokenizer.zig | Traditional tokenizer |
compiler/astdb/lexer.zig | ASTDB region-based lexer |
compiler/astdb/core.zig | ASTDB token/trivia definitions |
compiler/libjanus/janus_parser.zig | Uses janus_tokenizer |
compiler/semantic_analyzer.zig | Uses RegionLexer |
Maintenance Guidelines
Section titled “Maintenance Guidelines”When adding new token support:
- Update both lexers for consistency
- Test with E2E tests (uses janus_tokenizer path)
- Test with semantic analysis (uses RegionLexer path)
- Document any divergence in this file