AI Development & Enterprise AI Solutions

The lineup

Models we deploy in production

The right model depends on the workload, not the marketing. We benchmark per project on your data before recommending.

Llama 3.3 70B

Qwen 3 (8B / 32B / 72B)

Alibaba

Excellent reasoning + tool use. Strong multilingual coverage including CJK. Best long-context support.

ReasoningTool use128k context

DeepSeek R1 / V3

DeepSeek

Frontier reasoning model with chain-of-thought. Best open weights for math, code, complex agents.

ReasoningMoECost-efficient

Mistral Large / Nemo

Mistral

European licence-friendly weights. Nemo (12B) is a sweet spot for edge deployment.

EU-friendlyFunction calling

Llama 3.2 (1B / 3B)

Phi-3.5 · Gemma 2

Microsoft / Google

Small models with strong instruction-following. Useful when latency and memory dominate over peak quality.

SmallDistilled

Quantization

How to make 70B fit

Quantization is the difference between needing a datacenter and fitting on a workstation. Every format trades memory against quality - here's our reference table for a 70B model.

Format	Bits	VRAM (70B)	Quality loss	Best for
FP16	16	~140 GB	None (baseline)	Datacenter, full fidelity
FP8	8	~70 GB	Negligible	H100, fast + high quality
AWQ	4	~40 GB	Very low	Production serving (vLLM)
GPTQ	4	~40 GB	Low	Workstation, RTX 4090
GGUF Q4_K_M	4-ish	~42 GB	Low	llama.cpp, on-device, CPU+GPU
GGUF Q2_K	2-ish	~24 GB	Noticeable	Aggressive memory savings

Serving stack

The runtime matters as much as the model

A 70B model on the wrong serving stack can be 5× slower than on the right one. We pick the runtime to match the workload.

vLLM

Most LLM serving

Our default for high-throughput inference. PagedAttention, continuous batching, OpenAI-compatible API. Production-grade.

TensorRT-LLM

Latency-critical, Nvidia-only

Nvidia's compiled inference engine. Best raw latency and throughput on H100/H200, at the cost of build complexity.

SGLang

Agents, structured output

Strong on structured generation, agent loops, and complex prompt workflows. Excellent constrained decoding.

llama.cpp

Edge, on-device, CPU

CPU + GPU hybrid runtime, GGUF format. The only serious choice for desktop, edge, and CPU-only deployments.

Use cases we ship

What customers actually build

Private RAG

Internal docs Q&A over a VPC-bound model.

Agents

Tool-using workflows with deterministic structured output.

Domain fine-tunes

LoRA / QLoRA on your data for a vertical model.

Edge / on-device

Llama 3.2 1B/3B and Phi-3.5 on Jetson and laptops.

Proof

A case study from the rotation

Legal Tech & Compliance

Domain-adapted 7B LLM cuts inference costs by 60% vs hosted GPT-4

A legal services portal needed contract-analysis AI inside a private VPC. We fine-tuned Llama-3-8B on 12M lines of contract data and shipped it on their own A100 cluster.

Read the full case study

Open-weight LLMs, fine-tuned and deployed where you need them.