Skip to content

std.text.index

std.text.index provides a bounded, in-memory inverted full-text index for short documents and compiler-side search substrates. It is substrate-agnostic: the module owns tokenization, posting lists, and BM25-class scoring, while persistence remains the caller’s job.

The current implementation is pure :core Janus. It does not allocate, does not open files, and stores all index state inside the caller-owned Index value.

use std.text.index as idx
pub func main() -> i32 do
var index = idx.make_index()
const doc0_text: []u8 = "sovereign identity mesh"
const doc1_text: []u8 = "mesh network protocol"
var doc0_id: u32 = 0
var doc1_id: u32 = 0
if idx.add_doc(&index, 0, doc0_text, &doc0_id) == false do
return 1
end
if idx.add_doc(&index, 1, doc1_text, &doc1_id) == false do
return 2
end
var hits: [idx.MAX_POSTINGS_PER_TERM]u32 = .undefined
var hit_count: u32 = 0
if idx.query_term(&index, "mesh", hits[0..], &hit_count) == false do
return 3
end
if hit_count != 2 do
return 4
end
return 0
end

The shipped index is deliberately bounded while the compiler’s dynamic slice-field lowering matures.

ConstantMeaningCurrent value
MAX_TERMSDistinct terms per index128
MAX_POSTINGS_PER_TERMDocuments per posting list64
MAX_DOCSDocuments per index128
MAX_TERM_LENBytes retained per normalized term48
MAX_TOKENS_PER_DOCTokens extracted from one document128

Terms are ASCII-lowercased. Whitespace and common ASCII punctuation split tokens. Stemming, Unicode segmentation, stop-word filtering, and persistent storage are intentionally outside the current module.

The profile is bounded, but overflow is explicit. Use the status APIs when a caller must distinguish “term missing” from “result buffer too small” or “document rejected because the bounded profile was exceeded”.

StatusMeaning
INDEX_OKOperation completed without truncation or capacity loss
INDEX_DOC_CAPACITY_EXCEEDEDThe document table is full
INDEX_TOKEN_CAPACITY_EXCEEDEDTokenization saw more than MAX_TOKENS_PER_DOC tokens
INDEX_TERM_TRUNCATEDA token exceeded MAX_TERM_LEN
INDEX_TERM_CAPACITY_EXCEEDEDThe term dictionary is full
INDEX_POSTING_CAPACITY_EXCEEDEDA term posting list is full
INDEX_RESULT_CAPACITY_EXCEEDEDQuery matched more docs than out_buf can hold
INDEX_MISSING_TERMQuery term was not indexed
pub struct Index {
terms: [MAX_TERMS]TermEntry,
term_count: u32,
docs: [MAX_DOCS]DocEntry,
doc_count: u32,
total_len: u64,
}
pub struct Bm25Stats {
doc_count: u64,
total_token_len: u64,
}

Index is the whole index image. Put it on the stack, inside a larger store record, or behind a caller-owned allocation. The module does not retain any outside pointer to document text.

pub func make_index() -> Index

Returns a fresh empty index.

pub func tokenize_status(
text: []const u8,
out_terms: *[MAX_TOKENS_PER_DOC][MAX_TERM_LEN]u8,
out_lens: *[MAX_TOKENS_PER_DOC]u32,
out_count: *u32,
) -> u32
pub func tokenize(
text: []const u8,
out_terms: *[MAX_TOKENS_PER_DOC][MAX_TERM_LEN]u8,
out_lens: *[MAX_TOKENS_PER_DOC]u32,
) -> u32

Splits a byte string into lowercase terms. The caller owns the fixed output arrays. tokenize_status writes the number of copied tokens to out_count and returns an INDEX_* status. tokenize is the compatibility wrapper that returns only the copied count.

pub func add_doc_status(
idx: *Index,
doc_id: u32,
text: []const u8,
out_id: *u32,
out_token_count: *u32,
) -> u32
pub func add_doc(idx: *Index, doc_id: u32, text: []const u8, out_id: *u32) -> bool

Tokenizes text, stores document metadata, updates posting lists, and writes doc_id into out_id on success. add_doc_status is transactional for non-OK capacity statuses: it returns the relevant INDEX_* status without mutating idx. add_doc remains as the compatibility wrapper and returns false for any non-OK checked status.

pub func query_term_status(
idx: *const Index,
term: []const u8,
out_buf: []u32,
out_count: *u32,
out_total: *u32,
) -> u32
pub func query_term(
idx: *const Index,
term: []const u8,
out_buf: []u32,
out_count: *u32,
) -> bool

Looks up one normalized term and copies matching document IDs into out_buf. query_term_status writes the copied count to out_count, the full posting-list size to out_total, and returns INDEX_RESULT_CAPACITY_EXCEEDED when the caller buffer is too small. query_term remains as the compatibility wrapper and returns false only when the term was never indexed.

pub func bm25_score(idx: *const Index, doc_id: u32, term: []const u8) -> u32

Computes a fixed-point BM25-class score for a single (doc_id, term) pair. The return value is scaled by 1000. A missing document, missing term, or zero term frequency returns 0.

pub func bm25_stats(idx: *const Index) -> Bm25Stats

Returns corpus-level counters used by the scorer and by smoke tests: doc_count and total_token_len.

std.text.index is the algorithmic core. It is useful for small in-memory indexes, compiler smokes, and future code-graph query substrates. It is not yet the persistent ASTDB search layer. Reserved key encoding, storage integration, and reopen behavior belong in a later store-backed facade.

The focused smoke target is:

Terminal window
./scripts/zb test-text-index

The smoke builds std/text/index_smoke.jan, adds two documents, verifies shared and unique term lookups, checks corpus stats, and proves bm25_score returns non-zero for indexed hits and zero for misses.

The bounded-profile status smoke is:

Terminal window
./scripts/zb test-text-index-overflow

It proves token overflow, term truncation, term-capacity overflow, posting-list overflow, document-table overflow, and query-result truncation all produce explicit INDEX_* statuses.