:compute — The Foundry

“Hardware, meet your match.”

:compute is where Janus meets the metal. Everything from :core — plus first-class support for tensors, device streams, and hardware acceleration. Run local LLMs, do scientific computing, process video in real-time. All with the same language you use everywhere else.

Aliases: :npu, :tensor, :gpu, :cuda, :triton, :pytorch, :ai all resolve to :compute. There is no separate :npu profile enum — use :compute in new code.

Current Shipped Surface

The current :compute stdlib surface is still early, but it now has two useful CPU-first reference layers.

std.compute.vector provides scalar metrics, normalization, seeded rotation/unrotation, b2/b4 bit packing, fixed Lloyd-Max b2/b4 quantization, and caller-owned top-k heap helpers. Positional TurboIndex v0 is available for in-memory search over retained packed codes, and stable-ID IdMapIndex v0 adds external u64 IDs, delete repair, and allowlist-before-heap search.

std.compute.coordination provides deterministic multi-agent coordination: agent memory windows, hard obstacles, Gaussian coverage/avoid fields, greedy zone assignment, compact DecisionFrame encode/decode for STL audit events, agent_stale_ticks memory-age preflight, agent_memory_remaining_ticks freshness-budget preflight, AgentReadiness / agent_readiness reason preflight, agent_can_travel boolean readiness preflight, CandidateFacts rows, candidate_status one-pair status preflight, COORD_UNREACHABLE_TRAVEL_TIME stopped-agent sentinel, ZoneReadiness / zone_readiness reason preflight, zone_blocked hard-obstacle boolean preflight, unsafe_assignment safety rows, assignment row queries/rejection/conflict checks plus stats/health, normalized planner-output rates, DecisionLedger replay summaries, replay tick-span reporting, total-row denominators, rows-per-frame replay load, and assignment/rejected/conflict/stale-agent-rate signals over decoded frames, and std.compute.coordination_stl for producing STL events. CoordCell and the JCK1 key codec give LMX callers deterministic catalog keys for read-heavy field maps and zone indexes, including a fixed 3x3 neighborhood window for nearby catalog probes. It is a planner kernel, not an autopilot or storage engine.

use std.compute.vector

let a = [_]f32{1.0, 2.0, 3.0}
let b = [_]f32{4.0, 5.0, 6.0}
let score = vector.dot(a, b, 3)

SIMD and device backends remain active :compute work. Current reference paths are CPU-first and explicit about their boundaries.

What `:compute` Gives You

Tensors — N-Dimensional Arrays

let weights = tensor<f32, [4096, 4096]>.load("model.bin")
let input   = tensor<f32, [1, 4096]>.on(.vram)

let result = matmul(input, weights)
    .quantize(.qvl)
    .on(.npu)

print("Inference done in ${result.latency_ms}ms")

Shape inference — Type tracks dimensions
Device targeting — .on(.cpu), .on(.gpu), .on(.npu)
Quantization — QVL, INT8, FP16, BF16 support

Memory Spaces

func process_batch(data: tensor<f32, [32, 1024]>.on(.vram)) do
    # This runs on GPU
    let result := data
        |> layer_norm()
        |> attention()
        |> feed_forward()

    # Copy back to CPU for output
    return result.on(.cpu)
end

.on(.sram) — Fast on-chip SRAM (embedded)
.on(.dram) — Main system memory
.on(.vram) — GPU/accelerator memory
.on(.shared) — Unified memory (when available)

Device Streams

# Async GPU operations
let stream := DeviceStream.on(.gpu)

stream.launch(kernel_1024, blocks: 64, threads: 256)
stream.synchronize()

let output := tensor<f32, [1024]>.on(.gpu)

J-IR Graph Extraction

# Extract the compute graph before lowering
let graph := extract_jir(my_inference_func)
let optimized := graph.optimize(.fusion, .constant_fold)

Optimize before hitting hardware
Fuse operations for maximum throughput
Constant fold at compile time

What `:compute` Excludes

Excluded	Available In
Actors and grains	`:cluster`
Supervision trees	`:cluster`
Effects system	`:sovereign`
Raw pointers	`:sovereign`

When to Use `:compute`

Perfect for:

AI inference (local LLMs, image classification, voice)
Scientific computing (physics, chemistry, climate models)
Signal processing and DSP
GPU compute shaders
Real-time video/image processing
Matrix operations at scale

The rule: If you’re doing the same operation on thousands of data points, :compute is your friend.

Code Examples

Local LLM Inference

func main() do
    let model := LLModel.load("llama-7b-q4.bin")
        .quantize(.q4_0)
        .on(.npu)

    let prompt := "Write a haiku about sovereignty"
    let tokens := model.tokenize(prompt)

    let result := model.generate(tokens, max_tokens: 100)
        .temperature(0.7)
        .on(.npu)

    print(model.decode(result.tokens))
end

Image Processing Pipeline

func process_images(input_dir: String, output_dir: String) do
    let images := glob("${input_dir}/*.png")

    let batch := images
        |> load_batch(32)
        |> normalize(0.0, 255.0)
        .on(.gpu)

    let features := batch
        |> resize(224, 224)
        |> apply_model(resnet50)
        .on(.gpu)

    let embeddings := features
        |> flatten()
        .on(.cpu)

    save_embeddings(embeddings, "${output_dir}/features.npy")
end

Scientific Simulation

func simulate_particles(count: usize) tensor<f32, [count, 3]> do
    let positions := tensor<f32, [count, 3]>.random_uniform(-10.0, 10.0)
    let velocities := tensor<f32, [count, 3]>.zeros()

    for step in 0..1000 do
        # Compute forces
        let forces := compute_forces(positions)

        # Update velocities
        velocities = velocities + forces * dt

        # Update positions
        positions = positions + velocities * dt

        # Boundary conditions
        positions = clamp(positions, -10.0, 10.0)
    end

    return positions
end

Matrix Operations

func main() do
    let a := tensor<f32, [1024, 2048]>.random()
    let b := tensor<f32, [2048, 512]>.random()

    # Matrix multiplication on GPU
    let c := matmul(a, b).on(.gpu)

    # Element-wise operations
    let d := c * 2.0 + 1.0

    # Reduction
    let sum := d.sum()
    print("Sum: ${sum}")
end

Why :compute Wins

vs. Python (NumPy/PyTorch):

Compile-time shape checking — catch dimension mismatches before running
Zero-copy where possible — no unnecessary data movement
Single deployment — no Python runtime, no CUDA dependencies
Single language — same Janus for data loading, processing, serving

vs. CUDA/C++:

Productivity — write kernels in high-level Janus
Safety — memory spaces prevent illegal accesses
Portability — same code targets CPU, GPU, NPU

vs. Julia:

Stability — Janus compiles, Julia optimizes
Deployment — static binary, no JIT warmup
Ecosystem — same package manager as everything else

Next Steps

Move to :core — For CPU-only workloads
Move to :sovereign — For custom kernels
Reference: Tensors — API details

Make the metal sing.

:compute — The Foundry

:compute — The Foundry

Current Shipped Surface

What :compute Gives You

Tensors — N-Dimensional Arrays

Memory Spaces

Device Streams

J-IR Graph Extraction

What :compute Excludes

When to Use :compute

Code Examples

Local LLM Inference

Image Processing Pipeline

Scientific Simulation

Matrix Operations

Why :compute Wins

Next Steps

What `:compute` Gives You

What `:compute` Excludes

When to Use `:compute`