std.text.index

std.text.index provides a bounded, in-memory inverted full-text index for short documents and compiler-side search substrates. It is substrate-agnostic: the module owns tokenization, posting lists, and BM25-class scoring, while persistence remains the caller’s job.

The current implementation is pure :core Janus. It does not allocate, does not open files, and stores all index state inside the caller-owned Index value.

Quick Example

use std.text.index as idx

pub func main() -> i32 do
    var index = idx.make_index()

    const doc0_text: []u8 = "sovereign identity mesh"
    const doc1_text: []u8 = "mesh network protocol"

    var doc0_id: u32 = 0
    var doc1_id: u32 = 0

    if idx.add_doc(&index, 0, doc0_text, &doc0_id) == false do
        return 1
    end
    if idx.add_doc(&index, 1, doc1_text, &doc1_id) == false do
        return 2
    end

    var hits: [idx.MAX_POSTINGS_PER_TERM]u32 = .undefined
    var hit_count: u32 = 0
    if idx.query_term(&index, "mesh", hits[0..], &hit_count) == false do
        return 3
    end

    if hit_count != 2 do
        return 4
    end

    return 0
end

Capacity Model

The shipped index is deliberately bounded while the compiler’s dynamic slice-field lowering matures.

Constant	Meaning	Current value
`MAX_TERMS`	Distinct terms per index	128
`MAX_POSTINGS_PER_TERM`	Documents per posting list	64
`MAX_DOCS`	Documents per index	128
`MAX_TERM_LEN`	Bytes retained per normalized term	48
`MAX_TOKENS_PER_DOC`	Tokens extracted from one document	128

Terms are ASCII-lowercased. Whitespace and common ASCII punctuation split tokens. Stemming, Unicode segmentation, stop-word filtering, and persistent storage are intentionally outside the current module.

The profile is bounded, but overflow is explicit. Use the status APIs when a caller must distinguish “term missing” from “result buffer too small” or “document rejected because the bounded profile was exceeded”.

Status	Meaning
`INDEX_OK`	Operation completed without truncation or capacity loss
`INDEX_DOC_CAPACITY_EXCEEDED`	The document table is full
`INDEX_TOKEN_CAPACITY_EXCEEDED`	Tokenization saw more than `MAX_TOKENS_PER_DOC` tokens
`INDEX_TERM_TRUNCATED`	A token exceeded `MAX_TERM_LEN`
`INDEX_TERM_CAPACITY_EXCEEDED`	The term dictionary is full
`INDEX_POSTING_CAPACITY_EXCEEDED`	A term posting list is full
`INDEX_RESULT_CAPACITY_EXCEEDED`	Query matched more docs than `out_buf` can hold
`INDEX_MISSING_TERM`	Query term was not indexed

Types

pub struct Index {
    terms:      [MAX_TERMS]TermEntry,
    term_count: u32,
    docs:       [MAX_DOCS]DocEntry,
    doc_count:  u32,
    total_len:  u64,
}

pub struct Bm25Stats {
    doc_count:       u64,
    total_token_len: u64,
}

Index is the whole index image. Put it on the stack, inside a larger store record, or behind a caller-owned allocation. The module does not retain any outside pointer to document text.

API

make_index

pub func make_index() -> Index

Returns a fresh empty index.

tokenize

pub func tokenize_status(
    text:      []const u8,
    out_terms: *[MAX_TOKENS_PER_DOC][MAX_TERM_LEN]u8,
    out_lens:  *[MAX_TOKENS_PER_DOC]u32,
    out_count: *u32,
) -> u32

pub func tokenize(
    text:      []const u8,
    out_terms: *[MAX_TOKENS_PER_DOC][MAX_TERM_LEN]u8,
    out_lens:  *[MAX_TOKENS_PER_DOC]u32,
) -> u32

Splits a byte string into lowercase terms. The caller owns the fixed output arrays. tokenize_status writes the number of copied tokens to out_count and returns an INDEX_* status. tokenize is the compatibility wrapper that returns only the copied count.

add_doc

pub func add_doc_status(
    idx:             *Index,
    doc_id:          u32,
    text:            []const u8,
    out_id:          *u32,
    out_token_count: *u32,
) -> u32

pub func add_doc(idx: *Index, doc_id: u32, text: []const u8, out_id: *u32) -> bool

Tokenizes text, stores document metadata, updates posting lists, and writes doc_id into out_id on success. add_doc_status is transactional for non-OK capacity statuses: it returns the relevant INDEX_* status without mutating idx. add_doc remains as the compatibility wrapper and returns false for any non-OK checked status.

query_term

pub func query_term_status(
    idx:       *const Index,
    term:      []const u8,
    out_buf:   []u32,
    out_count: *u32,
    out_total: *u32,
) -> u32

pub func query_term(
    idx:       *const Index,
    term:      []const u8,
    out_buf:   []u32,
    out_count: *u32,
) -> bool

Looks up one normalized term and copies matching document IDs into out_buf. query_term_status writes the copied count to out_count, the full posting-list size to out_total, and returns INDEX_RESULT_CAPACITY_EXCEEDED when the caller buffer is too small. query_term remains as the compatibility wrapper and returns false only when the term was never indexed.

bm25_score

pub func bm25_score(idx: *const Index, doc_id: u32, term: []const u8) -> u32

Computes a fixed-point BM25-class score for a single (doc_id, term) pair. The return value is scaled by 1000. A missing document, missing term, or zero term frequency returns 0.

bm25_stats

pub func bm25_stats(idx: *const Index) -> Bm25Stats

Returns corpus-level counters used by the scorer and by smoke tests: doc_count and total_token_len.

Operational Boundary

std.text.index is the algorithmic core. It is useful for small in-memory indexes, compiler smokes, and future code-graph query substrates. It is not yet the persistent ASTDB search layer. Reserved key encoding, storage integration, and reopen behavior belong in a later store-backed facade.

Verification

The focused smoke target is:

./scripts/zb test-text-index

The smoke builds std/text/index_smoke.jan, adds two documents, verifies shared and unique term lookups, checks corpus stats, and proves bm25_score returns non-zero for indexed hits and zero for misses.

The bounded-profile status smoke is:

./scripts/zb test-text-index-overflow

It proves token overflow, term truncation, term-capacity overflow, posting-list overflow, document-table overflow, and query-result truncation all produce explicit INDEX_* statuses.