Large Language Models · Open weights

Open-weight LLMs, fine-tuned and deployed where you need them.

Llama 3.3, Qwen 3, DeepSeek R1, Mistral Large - picked, quantized, fine-tuned, and served on the infrastructure that fits your privacy and latency budget. Not API wrappers. Real deployments.

70B
models running on DGX Spark today
Sub-25ms
first-token on A100 (AWQ 4-bit)
60%
typical cost cut vs hosted GPT-4
100%
data stays in your VPC
The lineup

Models we deploy in production

The right model depends on the workload, not the marketing. We benchmark per project on your data before recommending.

Meta

Llama 3.3 70B

Meta

Frontier-quality open weights. Our default for general-purpose chat, agents, and RAG when 70B fits.

70BMultilingualTool use
Qwen

Qwen 3 (8B / 32B / 72B)

Alibaba

Excellent reasoning + tool use. Strong multilingual coverage including CJK. Best long-context support.

ReasoningTool use128k context
DeepSeek

DeepSeek R1 / V3

DeepSeek

Frontier reasoning model with chain-of-thought. Best open weights for math, code, complex agents.

ReasoningMoECost-efficient
Mistral

Mistral Large / Nemo

Mistral

European licence-friendly weights. Nemo (12B) is a sweet spot for edge deployment.

EU-friendlyFunction calling
Meta

Llama 3.2 (1B / 3B)

Meta

Tiny models for on-device deployment. Surprisingly capable for classification, routing, summarisation.

EdgeOn-deviceTiny
Azure

Phi-3.5 · Gemma 2

Microsoft / Google

Small models with strong instruction-following. Useful when latency and memory dominate over peak quality.

SmallDistilled
Quantization

How to make 70B fit

Quantization is the difference between needing a datacenter and fitting on a workstation. Every format trades memory against quality - here's our reference table for a 70B model.

FormatBitsVRAM (70B)Quality lossBest for
FP1616~140 GBNone (baseline)Datacenter, full fidelity
FP88~70 GBNegligibleH100, fast + high quality
AWQ4~40 GBVery lowProduction serving (vLLM)
GPTQ4~40 GBLowWorkstation, RTX 4090
GGUF Q4_K_M4-ish~42 GBLowllama.cpp, on-device, CPU+GPU
GGUF Q2_K2-ish~24 GBNoticeableAggressive memory savings
Serving stack

The runtime matters as much as the model

A 70B model on the wrong serving stack can be 5× slower than on the right one. We pick the runtime to match the workload.

vLLM

Most LLM serving

Our default for high-throughput inference. PagedAttention, continuous batching, OpenAI-compatible API. Production-grade.

TensorRT-LLM

Latency-critical, Nvidia-only

Nvidia's compiled inference engine. Best raw latency and throughput on H100/H200, at the cost of build complexity.

SGLang

Agents, structured output

Strong on structured generation, agent loops, and complex prompt workflows. Excellent constrained decoding.

llama.cpp

Edge, on-device, CPU

CPU + GPU hybrid runtime, GGUF format. The only serious choice for desktop, edge, and CPU-only deployments.

Use cases we ship

What customers actually build

Private RAG

Internal docs Q&A over a VPC-bound model.

Agents

Tool-using workflows with deterministic structured output.

Domain fine-tunes

LoRA / QLoRA on your data for a vertical model.

Edge / on-device

Llama 3.2 1B/3B and Phi-3.5 on Jetson and laptops.

Pick the right LLM for your use case.

Tell us your latency, memory, and accuracy targets. We'll come back with two or three models that fit - and the deployment plan.

Talk to us about LLMs
See case studies