Agent Inference

Every frontier model, plus the ones we run ourselves.

Agent Inference is the model layer beneath HQ. Every agent reaches frontier models from Anthropic, OpenAI, xAI, and Cerebras through one interface, routed to whichever fits the task, alongside the open models we serve on our own GPUs. The cheap, constant, and sensitive work, embeddings and reranking included, stays on hardware we run, never shipped to someone else's API.

4 providers Anthropic, OpenAI, xAI, Cerebras

Self-hosted open models on GPUs we run

One layer swap models without touching agents

Request access All core systems

Why Agent Inference

Betting the platform on one model is betting wrong.

Pick a single model vendor and you inherit their outages, their price changes, their rate limits, and their roadmap. The best model for deep reasoning, for fast triage, for cheap bulk work, and for a sovereign deployment is rarely the same model, and the ranking shifts every few months.

There is a quieter problem underneath. Most agent platforms ship your data to a third-party API just to embed or rerank it, which are the highest-volume calls they make. Your most sensitive text leaves your provider for a step you never see.

Agent Inference answers both: a provider-agnostic layer that routes every call to the model that fits, and serves the high-volume and data-sensitive paths on hardware we run ourselves.

Capabilities

What it can do.

Frontier models, routed

Reach Claude from Anthropic, GPT from OpenAI, Grok from xAI, and high-throughput open-model inference from Cerebras, all behind one interface. Each call goes to the model that wins on quality, latency, or cost for that task.

Open models we serve ourselves

We run open-weight models, gpt-oss-120b among them, on our own GPUs. The cheap, high-volume, and data-sensitive paths never have to leave our infrastructure to get answered.

A model per agent

Give each agent the model that fits its job: a frontier model for hard reasoning, a fast small one for triage. Run a fleet where every member is on a different model at once.

Embeddings, in-house

The vectors behind memory and search come from an embedding model we host, so the text being indexed is never sent to an outside API to be turned into numbers.

Reranking, in-house

A cross-encoder reranking model, also self-hosted, does the final relevance pass on every memory and corpus lookup, sharpening recall with no third-party round-trip.

A batch path at fleet scale

Large non-interactive jobs run through a batch lane at up to roughly half the cost, for work that can wait a while rather than answer in the moment.

Automatic failover

If a provider errors or goes down, the call retries against a healthy one inside the same request, so a vendor outage does not become your outage.

Three agent runtimes, your pick

Each agent runs on a vendor's official agent loop, the Claude Agent SDK, the OpenAI Agents SDK, or the open OpenCode runtime, never a hand-rolled wrapper. The runtime owns tool dispatch, MCP wiring, and session state; the model layer just hands it the model it asked for.

Independent of the runtime

The model and the runtime are two separate dials: any model can run under any of those runtimes, so each choice is made on its own merits and never forced by the other.

Built different

Why it is not just a key pasted into a SaaS.

Provider-agnostic by architecture

The model layer is a clean trait surface: a new provider or a new self-hosted endpoint plugs in behind the same interface. Adding or swapping a model is a config change, not a rewrite, and no agent is wired to a vendor.
We own the high-volume path

Embeddings and reranking are the calls an agent platform makes most. Running them on our own models, on our own GPUs, is the difference between a sovereignty story and a marketing line.
Two tiers on purpose

Frontier models for the hard, rare reasoning; owned open models for the cheap, constant, and sensitive work. The split is a deliberate strategy, not a degraded fallback.
No lock-in, by design

Because models route behind one layer, the platform can follow the frontier as it moves to this month's best model, not the one we happened to integrate first, with nothing to change in the agents on top.
Inference is part of the OS

The model layer sits inside qOS next to search, cache, and storage. It shares the same hardware, the same isolation, and the same audit trail as everything else an agent touches.

By the numbers

The model layer, by the numbers.

Model names and versions track the frontier; the layer underneath is provider-agnostic and stays put.

4frontier providers, one interface

2 tiersrouted frontier plus owned open models

In-houseembeddings and reranking on our GPUs

~50%cheaper on the batch path

Agent-first

The model is a choice the agent's owner makes.

Picking a model is not buried in platform settings; it is part of defining an agent. Give a research agent a frontier reasoning model and a triage agent a fast small one, and run them side by side in the same workspace. Underneath, the same layer quietly powers the parts an agent never asks for by name: the embeddings that index its memory, the reranker that sharpens its recall, and the overnight jobs that distill what it learned.

Per-agent model selection, so a fleet can run each member on the model that fits its job.
The model behind memory and search is ours, so indexing and recall happen without shipping text out.
Consolidation and distillation run on the cheaper batch lane, not the interactive one.

Put Agent Inference to work.

See HQ running in your own Slack or Teams, on the operating system we built for agents.

Request access Explore qOS