Build a Small Full-Text Index

This tutorial walks through the shipped std.text.index surface. You will build a small in-memory index, query a shared term, and compare BM25-class scores.

Time: 10 minutes Level: Intermediate Prerequisites: use imports, slices, arrays, and basic u32 counters.

Create the Index

Import the module with an alias and create the caller-owned index value.

use std.text.index as idx

pub func main() -> i32 do
    var index = idx.make_index()
    return 0
end

make_index returns the whole Index value. There is no allocator parameter because the current implementation is bounded and stores its arrays inside the struct.

Add Documents

Each document gets a caller-assigned u32 ID.

const doc0_text: []u8 = "sovereign identity mesh"
const doc1_text: []u8 = "mesh network protocol"

var doc0_id: u32 = 0
var doc1_id: u32 = 0

if idx.add_doc(&index, 0, doc0_text, &doc0_id) == false do
    return 1
end

if idx.add_doc(&index, 1, doc1_text, &doc1_id) == false do
    return 2
end

add_doc lowercases ASCII letters, splits on whitespace and common punctuation, records the document token count, and appends one posting per distinct term in the document.

Query a Term

query_term writes document IDs into a caller-provided buffer.

var hits: [idx.MAX_POSTINGS_PER_TERM]u32 = .undefined
var hit_count: u32 = 0

if idx.query_term(&index, "mesh", hits[0..], &hit_count) == false do
    return 3
end

if hit_count != 2 do
    return 4
end

The query term is normalized the same way indexed terms are normalized. The result order is the posting-list order, so code that cares about ranking should score the hits explicitly.

Score Matches

Use bm25_score when term frequency and document length should affect ranking.

let score0 = idx.bm25_score(&index, 0, "mesh")
let score1 = idx.bm25_score(&index, 1, "mesh")

if score0 == 0 do
    return 5
end
if score1 == 0 do
    return 6
end

Scores are fixed-point integers scaled by 1000. The exact value is an implementation detail; the stable contract is that missing documents, missing terms, and zero term-frequency return 0.

Inspect Corpus Stats

const stats = idx.bm25_stats(&index)
if stats.doc_count != 2 do
    return 7
end

total_token_len is the sum of all token counts and feeds the average document length used by BM25 normalization.

Complete Program

use std.text.index as idx

pub func main() -> i32 do
    var index = idx.make_index()

    const doc0_text: []u8 = "sovereign identity mesh"
    const doc1_text: []u8 = "mesh network protocol"

    var doc0_id: u32 = 0
    var doc1_id: u32 = 0

    if idx.add_doc(&index, 0, doc0_text, &doc0_id) == false do
        return 1
    end
    if idx.add_doc(&index, 1, doc1_text, &doc1_id) == false do
        return 2
    end

    var hits: [idx.MAX_POSTINGS_PER_TERM]u32 = .undefined
    var hit_count: u32 = 0
    if idx.query_term(&index, "mesh", hits[0..], &hit_count) == false do
        return 3
    end
    if hit_count != 2 do
        return 4
    end

    let score0 = idx.bm25_score(&index, 0, "mesh")
    let score1 = idx.bm25_score(&index, 1, "mesh")
    if score0 == 0 do return 5; end
    if score1 == 0 do return 6; end

    const stats = idx.bm25_stats(&index)
    if stats.doc_count != 2 do return 7; end

    return 0
end

Run the repository smoke for the canonical version:

cd janus
./scripts/zb test-text-index

Boundary

Do not treat this module as persistent search. The current index is an in-memory core with fixed capacities. Store-backed keying, reopen behavior, and ASTDB query integration belong in a later facade.