Models we deploy in production
The right model depends on the workload, not the marketing. We benchmark per project on your data before recommending.
Llama 3.3 70B
MetaFrontier-quality open weights. Our default for general-purpose chat, agents, and RAG when 70B fits.
Qwen 3 (8B / 32B / 72B)
AlibabaExcellent reasoning + tool use. Strong multilingual coverage including CJK. Best long-context support.
DeepSeek R1 / V3
DeepSeekFrontier reasoning model with chain-of-thought. Best open weights for math, code, complex agents.
Mistral Large / Nemo
MistralEuropean licence-friendly weights. Nemo (12B) is a sweet spot for edge deployment.
Llama 3.2 (1B / 3B)
MetaTiny models for on-device deployment. Surprisingly capable for classification, routing, summarisation.
Phi-3.5 · Gemma 2
Microsoft / GoogleSmall models with strong instruction-following. Useful when latency and memory dominate over peak quality.
How to make 70B fit
Quantization is the difference between needing a datacenter and fitting on a workstation. Every format trades memory against quality - here's our reference table for a 70B model.
| Format | Bits | VRAM (70B) | Quality loss | Best for |
|---|---|---|---|---|
| FP16 | 16 | ~140 GB | None (baseline) | Datacenter, full fidelity |
| FP8 | 8 | ~70 GB | Negligible | H100, fast + high quality |
| AWQ | 4 | ~40 GB | Very low | Production serving (vLLM) |
| GPTQ | 4 | ~40 GB | Low | Workstation, RTX 4090 |
| GGUF Q4_K_M | 4-ish | ~42 GB | Low | llama.cpp, on-device, CPU+GPU |
| GGUF Q2_K | 2-ish | ~24 GB | Noticeable | Aggressive memory savings |
The runtime matters as much as the model
A 70B model on the wrong serving stack can be 5× slower than on the right one. We pick the runtime to match the workload.
vLLM
Most LLM servingOur default for high-throughput inference. PagedAttention, continuous batching, OpenAI-compatible API. Production-grade.
TensorRT-LLM
Latency-critical, Nvidia-onlyNvidia's compiled inference engine. Best raw latency and throughput on H100/H200, at the cost of build complexity.
SGLang
Agents, structured outputStrong on structured generation, agent loops, and complex prompt workflows. Excellent constrained decoding.
llama.cpp
Edge, on-device, CPUCPU + GPU hybrid runtime, GGUF format. The only serious choice for desktop, edge, and CPU-only deployments.
What customers actually build
Private RAG
Internal docs Q&A over a VPC-bound model.
Agents
Tool-using workflows with deterministic structured output.
Domain fine-tunes
LoRA / QLoRA on your data for a vertical model.
Edge / on-device
Llama 3.2 1B/3B and Phi-3.5 on Jetson and laptops.