Frontier models, routed
Reach Claude from Anthropic, GPT from OpenAI, Grok from xAI, and high-throughput open-model inference from Cerebras, all behind one interface. Each call goes to the model that wins on quality, latency, or cost for that task.
Agent Inference
Agent Inference is the model layer beneath HQ. Every agent reaches frontier models from Anthropic, OpenAI, xAI, and Cerebras through one interface, routed to whichever fits the task, alongside the open models we serve on our own GPUs. The cheap, constant, and sensitive work, embeddings and reranking included, stays on hardware we run, never shipped to someone else's API.
Why Agent Inference
Pick a single model vendor and you inherit their outages, their price changes, their rate limits, and their roadmap. The best model for deep reasoning, for fast triage, for cheap bulk work, and for a sovereign deployment is rarely the same model, and the ranking shifts every few months.
There is a quieter problem underneath. Most agent platforms ship your data to a third-party API just to embed or rerank it, which are the highest-volume calls they make. Your most sensitive text leaves your provider for a step you never see.
Agent Inference answers both: a provider-agnostic layer that routes every call to the model that fits, and serves the high-volume and data-sensitive paths on hardware we run ourselves.
Capabilities
Reach Claude from Anthropic, GPT from OpenAI, Grok from xAI, and high-throughput open-model inference from Cerebras, all behind one interface. Each call goes to the model that wins on quality, latency, or cost for that task.
We run open-weight models, gpt-oss-120b among them, on our own GPUs. The cheap, high-volume, and data-sensitive paths never have to leave our infrastructure to get answered.
Give each agent the model that fits its job: a frontier model for hard reasoning, a fast small one for triage. Run a fleet where every member is on a different model at once.
The vectors behind memory and search come from an embedding model we host, so the text being indexed is never sent to an outside API to be turned into numbers.
A cross-encoder reranking model, also self-hosted, does the final relevance pass on every memory and corpus lookup, sharpening recall with no third-party round-trip.
Large non-interactive jobs run through a batch lane at up to roughly half the cost, for work that can wait a while rather than answer in the moment.
If a provider errors or goes down, the call retries against a healthy one inside the same request, so a vendor outage does not become your outage.
Each agent runs on a vendor's official agent loop, the Claude Agent SDK, the OpenAI Agents SDK, or the open OpenCode runtime, never a hand-rolled wrapper. The runtime owns tool dispatch, MCP wiring, and session state; the model layer just hands it the model it asked for.
The model and the runtime are two separate dials: any model can run under any of those runtimes, so each choice is made on its own merits and never forced by the other.
Built different
The model layer is a clean trait surface: a new provider or a new self-hosted endpoint plugs in behind the same interface. Adding or swapping a model is a config change, not a rewrite, and no agent is wired to a vendor.
Embeddings and reranking are the calls an agent platform makes most. Running them on our own models, on our own GPUs, is the difference between a sovereignty story and a marketing line.
Frontier models for the hard, rare reasoning; owned open models for the cheap, constant, and sensitive work. The split is a deliberate strategy, not a degraded fallback.
Because models route behind one layer, the platform can follow the frontier as it moves to this month's best model, not the one we happened to integrate first, with nothing to change in the agents on top.
The model layer sits inside qOS next to search, cache, and storage. It shares the same hardware, the same isolation, and the same audit trail as everything else an agent touches.
By the numbers
Model names and versions track the frontier; the layer underneath is provider-agnostic and stays put.
Agent-first
Picking a model is not buried in platform settings; it is part of defining an agent. Give a research agent a frontier reasoning model and a triage agent a fast small one, and run them side by side in the same workspace. Underneath, the same layer quietly powers the parts an agent never asks for by name: the embeddings that index its memory, the reranker that sharpens its recall, and the overnight jobs that distill what it learned.
See HQ running in your own Slack or Teams, on the operating system we built for agents.