std.text.regex
std.text.regex
Section titled “std.text.regex”Regex is the knife. PEG is the calligraphy brush.
std.text.regex provides compile-time validated regex literals with typed capture groups. It enables the $-family positional access in :script ($1, $2, etc.) and integrates with the pipeline algebra.
Quick Example
Section titled “Quick Example”let pattern := r/(\d{4})-(?<month>\d{2})-(?<day>\d{2})/
if let Some(m) := pattern.match("2024-03-15") then # Positional: $1 = "2024" # Named: m.month = "03", m.day = "15" print("Year: $1, Month: ${m.month}")endWhy Regex in Janus?
Section titled “Why Regex in Janus?”The Janus philosophy: Regex is a bounded tactical DSL. Elegance belongs to PEG.
- Use regex when the pattern is short and local
- Use PEG when the pattern has names, structure, or meaning
- If a regex needs more than a few captures, it has probably become PEG-shaped
Features
Section titled “Features”Compile-Time Validation
Section titled “Compile-Time Validation”Invalid regex syntax produces a compile error, not a runtime crash:
# This won't compile — invalid syntax caught at compile timelet bad := r/[/Typed Captures
Section titled “Typed Captures”Captures are typed at compile time based on the pattern:
# No captures → unit typelet simple: Regex[()] := r/^hello$/
# Positional captures → tuplelet nums: Regex[(u64, u64)] := r/(\d+)-(\d+)/
# Named captures → structlet date: Regex[Match { year: u64, month: u64 }] := r/(?<year>\d{4})-(?<month>\d{2})/Linear-Time Matching
Section titled “Linear-Time Matching”The engine uses Thompson’s NFA construction with O(n) worst-case performance. No backtracking. No ReDoS vulnerabilities.
Full Unicode
Section titled “Full Unicode”Unicode support is enabled by default. The engine correctly handles Unicode code points and grapheme clusters.
Syntax Reference
Section titled “Syntax Reference”Supported Constructs
Section titled “Supported Constructs”| Construct | Example | Description |
|---|---|---|
| Literal | abc | Match exact string |
| Any | . | Any character (except newline) |
| Character class | [a-z], [^0-9] | Set or range |
| Alternation | a|b | Either/or |
| Anchor | ^, $ | Start/end of string |
| Word boundary | \b | Word boundary |
| Digit | \d, \D | Digit/non-digit |
| Positional capture | (pattern) | Capture → $1, $2 |
| Named capture | (?<name>pattern) | Capture → struct field |
| Non-capturing | (?:pattern) | Group without capture |
| Repetition | *, +, ?, {n,m} | Repeat operators |
Not Supported
Section titled “Not Supported”- Lookahead/lookbehind
- Backreferences
- Conditional patterns
- PCRE extensions
Operations
Section titled “Operations”# Check if matchespattern.is_match(text) -> bool
# Find first matchpattern.match(text) -> ?Match[T]
# Find all matchespattern.find_all(text) -> Iterator[Match[T]]
# Replacepattern.replace(text, "replacement") -> String
# Splitpattern.split(text) -> Iterator[String]Capture Access
Section titled “Capture Access”if let Some(m) := r/(\d+)-(\w+)/.match("123-abc") then # Positional via $1, $2 (in :script) let num := $1 let word := $2
# Or direct tuple access let num2 := m.0 let word2 := m.1end
if let Some(m) := r/(?<y>\d{4})-(?<m>\d{2})/.match("2024-03") then # Named capture access let year := m.year let month := m.monthendPipeline Integration
Section titled “Pipeline Integration”The regex module integrates with the :script pipeline algebra:
<<p"access.log">> |> grep(r/ERROR.*(?<code>\d+)/) # Filter lines with ERROR |> map($_.code) # Extract named capture "code" |> filter($1.parse_int()? > 500) # Filter by positional $1 |> unique() |> for_each(println)Examples
Section titled “Examples”Email Validation
Section titled “Email Validation”func is_valid_email(email: String) -> bool do let pattern := r/^[\w.-]+@[\w.-]+\.\w{2,}$/ return pattern.is_match(email)endDate Parsing
Section titled “Date Parsing”func parse_iso_date(line: String) -> ?(u64, u64, u64) do let pattern := r/(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})/ match pattern.match(line) do | .some(m) => return .some((m.year?, m.month?, m.day?)) | .none => return .none endendLog Processing
Section titled “Log Processing”<<p"server.log">> |> grep(r/\[(?<level>\w+)\]/) |> map($_.level) |> filter($1 == "ERROR") |> count() |> println()Find and Replace
Section titled “Find and Replace”func mask_ssn(text: String) -> String do let pattern := r/\d{3}-\d{2}-\d{4}/ return pattern.replace(text, "XXX-XX-XXXX")endComparison
Section titled “Comparison”| Feature | Janus regex | Python re | JavaScript | Perl |
|---|---|---|---|---|
| Compile-time validation | ✅ | ❌ | ❌ | ❌ |
| Typed captures | ✅ | ❌ | ❌ | ❌ |
| Linear-time guarantee | ✅ | ❌ | ⚠️ | ❌ |
$1, $2 in pipelines | ✅ | ❌ | ❌ | ❌ |
| PEG alternative | ✅ | ❌ | ❌ | ❌ |
Next Steps
Section titled “Next Steps”- PEG (SPEC-046) — For complex, readable parsing
- TextStream — Pipeline algebra
- Script Profile — Using regex in pipelines
Elegance belongs to PEG. Regex is the knife.