std.text.index
std.text.index
Section titled “std.text.index”std.text.index provides a bounded, in-memory inverted full-text index for
short documents and compiler-side search substrates. It is substrate-agnostic:
the module owns tokenization, posting lists, and BM25-class scoring, while
persistence remains the caller’s job.
The current implementation is pure :core Janus. It does not allocate, does
not open files, and stores all index state inside the caller-owned Index
value.
Quick Example
Section titled “Quick Example”use std.text.index as idx
pub func main() -> i32 do var index = idx.make_index()
const doc0_text: []u8 = "sovereign identity mesh" const doc1_text: []u8 = "mesh network protocol"
var doc0_id: u32 = 0 var doc1_id: u32 = 0
if idx.add_doc(&index, 0, doc0_text, &doc0_id) == false do return 1 end if idx.add_doc(&index, 1, doc1_text, &doc1_id) == false do return 2 end
var hits: [idx.MAX_POSTINGS_PER_TERM]u32 = .undefined var hit_count: u32 = 0 if idx.query_term(&index, "mesh", hits[0..], &hit_count) == false do return 3 end
if hit_count != 2 do return 4 end
return 0endCapacity Model
Section titled “Capacity Model”The shipped index is deliberately bounded while the compiler’s dynamic slice-field lowering matures.
| Constant | Meaning | Current value |
|---|---|---|
MAX_TERMS | Distinct terms per index | 128 |
MAX_POSTINGS_PER_TERM | Documents per posting list | 64 |
MAX_DOCS | Documents per index | 128 |
MAX_TERM_LEN | Bytes retained per normalized term | 48 |
MAX_TOKENS_PER_DOC | Tokens extracted from one document | 128 |
Terms are ASCII-lowercased. Whitespace and common ASCII punctuation split tokens. Stemming, Unicode segmentation, stop-word filtering, and persistent storage are intentionally outside the current module.
The profile is bounded, but overflow is explicit. Use the status APIs when a caller must distinguish “term missing” from “result buffer too small” or “document rejected because the bounded profile was exceeded”.
| Status | Meaning |
|---|---|
INDEX_OK | Operation completed without truncation or capacity loss |
INDEX_DOC_CAPACITY_EXCEEDED | The document table is full |
INDEX_TOKEN_CAPACITY_EXCEEDED | Tokenization saw more than MAX_TOKENS_PER_DOC tokens |
INDEX_TERM_TRUNCATED | A token exceeded MAX_TERM_LEN |
INDEX_TERM_CAPACITY_EXCEEDED | The term dictionary is full |
INDEX_POSTING_CAPACITY_EXCEEDED | A term posting list is full |
INDEX_RESULT_CAPACITY_EXCEEDED | Query matched more docs than out_buf can hold |
INDEX_MISSING_TERM | Query term was not indexed |
pub struct Index { terms: [MAX_TERMS]TermEntry, term_count: u32, docs: [MAX_DOCS]DocEntry, doc_count: u32, total_len: u64,}
pub struct Bm25Stats { doc_count: u64, total_token_len: u64,}Index is the whole index image. Put it on the stack, inside a larger store
record, or behind a caller-owned allocation. The module does not retain any
outside pointer to document text.
make_index
Section titled “make_index”pub func make_index() -> IndexReturns a fresh empty index.
tokenize
Section titled “tokenize”pub func tokenize_status( text: []const u8, out_terms: *[MAX_TOKENS_PER_DOC][MAX_TERM_LEN]u8, out_lens: *[MAX_TOKENS_PER_DOC]u32, out_count: *u32,) -> u32
pub func tokenize( text: []const u8, out_terms: *[MAX_TOKENS_PER_DOC][MAX_TERM_LEN]u8, out_lens: *[MAX_TOKENS_PER_DOC]u32,) -> u32Splits a byte string into lowercase terms. The caller owns the fixed output
arrays. tokenize_status writes the number of copied tokens to out_count and
returns an INDEX_* status. tokenize is the compatibility wrapper that returns
only the copied count.
add_doc
Section titled “add_doc”pub func add_doc_status( idx: *Index, doc_id: u32, text: []const u8, out_id: *u32, out_token_count: *u32,) -> u32
pub func add_doc(idx: *Index, doc_id: u32, text: []const u8, out_id: *u32) -> boolTokenizes text, stores document metadata, updates posting lists, and writes
doc_id into out_id on success. add_doc_status is transactional for
non-OK capacity statuses: it returns the relevant INDEX_* status without
mutating idx. add_doc remains as the compatibility wrapper and returns
false for any non-OK checked status.
query_term
Section titled “query_term”pub func query_term_status( idx: *const Index, term: []const u8, out_buf: []u32, out_count: *u32, out_total: *u32,) -> u32
pub func query_term( idx: *const Index, term: []const u8, out_buf: []u32, out_count: *u32,) -> boolLooks up one normalized term and copies matching document IDs into out_buf.
query_term_status writes the copied count to out_count, the full posting-list
size to out_total, and returns INDEX_RESULT_CAPACITY_EXCEEDED when the caller
buffer is too small. query_term remains as the compatibility wrapper and
returns false only when the term was never indexed.
bm25_score
Section titled “bm25_score”pub func bm25_score(idx: *const Index, doc_id: u32, term: []const u8) -> u32Computes a fixed-point BM25-class score for a single (doc_id, term) pair. The
return value is scaled by 1000. A missing document, missing term, or zero term
frequency returns 0.
bm25_stats
Section titled “bm25_stats”pub func bm25_stats(idx: *const Index) -> Bm25StatsReturns corpus-level counters used by the scorer and by smoke tests:
doc_count and total_token_len.
Operational Boundary
Section titled “Operational Boundary”std.text.index is the algorithmic core. It is useful for small in-memory
indexes, compiler smokes, and future code-graph query substrates. It is not yet
the persistent ASTDB search layer. Reserved key encoding, storage integration,
and reopen behavior belong in a later store-backed facade.
Verification
Section titled “Verification”The focused smoke target is:
./scripts/zb test-text-indexThe smoke builds std/text/index_smoke.jan, adds two documents, verifies shared
and unique term lookups, checks corpus stats, and proves bm25_score returns
non-zero for indexed hits and zero for misses.
The bounded-profile status smoke is:
./scripts/zb test-text-index-overflowIt proves token overflow, term truncation, term-capacity overflow, posting-list
overflow, document-table overflow, and query-result truncation all produce
explicit INDEX_* statuses.