Skip to content

:compute — The Foundry

“Hardware, meet your match.”

:compute is where Janus meets the metal. Everything from :core — plus first-class support for tensors, device streams, and hardware acceleration. Run local LLMs, do scientific computing, process video in real-time. All with the same language you use everywhere else.


let weights = tensor<f32, [4096, 4096]>.load("model.bin")
let input = tensor<f32, [1, 4096]>.on(.vram)
let result = matmul(input, weights)
.quantize(.qvl)
.on(.npu)
print("Inference done in ${result.latency_ms}ms")
  • Shape inference — Type tracks dimensions
  • Device targeting.on(.cpu), .on(.gpu), .on(.npu)
  • Quantization — QVL, INT8, FP16, BF16 support
func process_batch(data: tensor<f32, [32, 1024]>.on(.vram)) do
# This runs on GPU
let result := data
|> layer_norm()
|> attention()
|> feed_forward()
# Copy back to CPU for output
return result.on(.cpu)
end
  • .on(.sram) — Fast on-chip SRAM (embedded)
  • .on(.dram) — Main system memory
  • .on(.vram) — GPU/accelerator memory
  • .on(.shared) — Unified memory (when available)
# Async GPU operations
let stream := DeviceStream.on(.gpu)
stream.launch(kernel_1024, blocks: 64, threads: 256)
stream.synchronize()
let output := tensor<f32, [1024]>.on(.gpu)
# Extract the compute graph before lowering
let graph := extract_jir(my_inference_func)
let optimized := graph.optimize(.fusion, .constant_fold)
  • Optimize before hitting hardware
  • Fuse operations for maximum throughput
  • Constant fold at compile time

ExcludedAvailable In
Actors and grains:cluster
Supervision trees:cluster
Effects system:sovereign
Raw pointers:sovereign

Perfect for:

  • AI inference (local LLMs, image classification, voice)
  • Scientific computing (physics, chemistry, climate models)
  • Signal processing and DSP
  • GPU compute shaders
  • Real-time video/image processing
  • Matrix operations at scale

The rule: If you’re doing the same operation on thousands of data points, :compute is your friend.


func main() do
let model := LLModel.load("llama-7b-q4.bin")
.quantize(.q4_0)
.on(.npu)
let prompt := "Write a haiku about sovereignty"
let tokens := model.tokenize(prompt)
let result := model.generate(tokens, max_tokens: 100)
.temperature(0.7)
.on(.npu)
print(model.decode(result.tokens))
end
func process_images(input_dir: String, output_dir: String) do
let images := glob("${input_dir}/*.png")
let batch := images
|> load_batch(32)
|> normalize(0.0, 255.0)
.on(.gpu)
let features := batch
|> resize(224, 224)
|> apply_model(resnet50)
.on(.gpu)
let embeddings := features
|> flatten()
.on(.cpu)
save_embeddings(embeddings, "${output_dir}/features.npy")
end
func simulate_particles(count: usize) tensor<f32, [count, 3]> do
let positions := tensor<f32, [count, 3]>.random_uniform(-10.0, 10.0)
let velocities := tensor<f32, [count, 3]>.zeros()
for step in 0..1000 do
# Compute forces
let forces := compute_forces(positions)
# Update velocities
velocities = velocities + forces * dt
# Update positions
positions = positions + velocities * dt
# Boundary conditions
positions = clamp(positions, -10.0, 10.0)
end
return positions
end
func main() do
let a := tensor<f32, [1024, 2048]>.random()
let b := tensor<f32, [2048, 512]>.random()
# Matrix multiplication on GPU
let c := matmul(a, b).on(.gpu)
# Element-wise operations
let d := c * 2.0 + 1.0
# Reduction
let sum := d.sum()
print("Sum: ${sum}")
end

vs. Python (NumPy/PyTorch):

  • Compile-time shape checking — catch dimension mismatches before running
  • Zero-copy where possible — no unnecessary data movement
  • Single deployment — no Python runtime, no CUDA dependencies
  • Single language — same Janus for data loading, processing, serving

vs. CUDA/C++:

  • Productivity — write kernels in high-level Janus
  • Safety — memory spaces prevent illegal accesses
  • Portability — same code targets CPU, GPU, NPU

vs. Julia:

  • Stability — Janus compiles, Julia optimizes
  • Deployment — static binary, no JIT warmup
  • Ecosystem — same package manager as everything else


Make the metal sing.