Skip to content

Architecture

Orchard's inference stack is built from the bottom up for Apple Silicon.

The Stack

┌─────────────────────────────────────────────────┐
│  Client Libraries (orchard-py, orchard-rs)       │
│  OpenAI-compatible API, streaming, tool calling  │
├─────────────────────────────────────────────────┤
│  PIE (Proxy Inference Engine)                    │
│  C++23 — core execution, batching, scheduling    │
├─────────────────────────────────────────────────┤
│  PSE (Proxy State Engine)                        │
│  Structured generation — grammars, JSON schemas  │
├─────────────────────────────────────────────────┤
│  PAL (Proxy Attention Lab)                       │
│  Custom Metal kernels — paged attention, RoPE    │
├─────────────────────────────────────────────────┤
│  Carbon (MLX fork)                               │
│  Array operations, JIT Metal compilation         │
└─────────────────────────────────────────────────┘

PIE — Proxy Inference Engine

The core inference server, built from scratch for Apple Silicon.

Key Capabilities

  • Continuous batching — dynamically packs prefill and decode sequences into batches each step, so multiple agents can share one model
  • Paged KV cache — fixed-size pages (16 tokens), managed by a page allocator, supports multi-tier layouts for architectures like Gemma 3
  • Prompt caching — reuses KV cache pages across requests with shared prefixes
  • Streaming — token-by-token streaming with structured events
  • Multimodal — vision model support (image preprocessing, embedding)
  • Tool calling — native function calling with structured output
  • Carbon JIT — Metal kernels embedded as string literals at build time, JIT-compiled at runtime

Execution Flow

Request (IPC)
  → Tokenize
  → Schedule (ARScheduler)
  → Batch pack (prefill + decode sequences)
  → Forward pass (model layers + Metal compute)
  → Logit processing (repetition, bias, temperature, top_k, top_p, min_p)
  → Sampling
  → PSE constraint check
  → Stream token back
  → Repeat until done

PSE — Proxy State Engine

The structured generation engine. PSE augments language models at runtime, steering token generation to produce valid output without sacrificing creative capability.

What PSE Does

  • JSON schema enforcement — guarantees model output conforms to a JSON schema
  • Grammar constraints — arbitrary context-free grammars
  • Tool call formatting — ensures function calls have valid names and arguments
  • State tracking — tracks where the model is in a behavioral graph (thinking states, answering states, coordinate generation)

How It Works

PSE builds a hierarchical state machine from the constraint specification. At each token generation step, it produces a bitmask of valid next tokens. The bitmask is applied to the logits before sampling — invalid tokens get negative infinity, valid tokens pass through unchanged.

This is alignment through architecture, not fine-tuning. The model is structurally incapable of producing invalid output for the constrained portions.

PAL — Proxy Attention Lab

Custom Metal GPU kernels for attention computation. PAL implements paged attention directly on Apple Silicon's GPU, bypassing the generic compute paths.

Key kernels:

  • Paged attention — efficient attention over non-contiguous KV cache pages
  • RoPE — rotary position embedding computation
  • Memory management — GPU-side page table operations

Carbon

A private fork of Apple's MLX framework. Carbon extends MLX with:

  • Multi-stream concurrency — multiple compute streams for overlapping operations
  • compile_with_stream — JIT compilation scoped to specific compute streams
  • Epoch-based buffer safety — prevents use-after-free in concurrent execution

Carbon is the foundation — all array operations, Metal compilation, and GPU scheduling flow through it.