Direct-route inference · online (last-known)

Direct-Route
AI Inference

Every request runs on a dedicated, single-tenant instance— not a shared, queue-based pool — so your first token isn't stuck behind other tenants' traffic. The result: consistent, low TTFT and zero data retention, every call.

Try on OpenRouter View Performance Dashboard

TTFT (last-known)

64ms

Throughput (last-known)

62 tok/s

Uptime 30d

99.98%

Data Retained

0 bytes

Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8

Why direct-route

Built for developers who can't afford a slow first token.

Four architectural guarantees that separate a dedicated node from queue-based routing.

Single-Instance Direct-Route

Requests route straight to a dedicated, single-tenant instance, bypassing the global scheduling queues of queue-based routing. Execution starts immediately — ultra-low, consistent TTFT on every call.

Zero middleware delay

Zero Data Retention + KV Isolation

Prompts and completions live entirely in GPU memory and are destroyed the instant a request completes. Page-level KV cache isolation makes cross-session context pollution physically impossible.

Memory-only · no disk write

Horizontal Scaling Fleet

The gateway scales horizontally. As volume grows, dedicated hardware instances are provisioned into the routing mesh under one unified endpoint — seamless capacity, no cold pools.

Unified endpoint

Guaranteed Performance Integrity

Models run at advertised precision at all times — no dynamic quantization downgrades, no context truncation under load. Plus direct-to-architect support, 24/7, bypassing tier-1 ticket queues.

No degradation · direct support

Zero Data Retention

Your data's entire lifecycle. It never touches a disk.

Every request flows through a memory-only pipeline and is destroyed the instant it completes. There is no logging stage to opt out of — persistence simply doesn't exist in the path.

User Request

Prompt + params

Routing Layer

Request ingress

Secure Edge

TLS / HTTPS

Inference Gateway

Direct-route proxy

In-Memory GPU

Paged KV, no disk

Immediate Destruction

0 bytes persisted

No disk writes Page-level KV isolation No cross-session pollution

TTFT Consistency

A flat line is the whole point.

On queue-based routing, first-token latency rises and falls with how deep the shared queue runs at that moment. A dedicated direct-route sidesteps the queue entirely, holding a flat line request after request.

Queue-based routingGWMM Single-Instance Direct-Route

Time to First Token · ms

Direct-route typical

64 ms

Direct-route jitter

±6 ms

Queued peak *

1,240 ms

Consistency gain *

19.4×

* shared pool metrics are illustrative of average queue-based routing degradation under load

Guest Playground

Feel the speed. No signup.

Query a live model and watch tokens render the instant they arrive — no typewriter easing, no buffering. What you see is the raw stream.

direct-route node

Ask anything. You're talking to a dedicated direct-route node — responses stream raw, at the speed the GPU produces them.

5 of 5 free guest queries remainingGet unlimited on OpenRouter

Scaling Fleet Roadmap

One node today. A routing mesh tomorrow.

Capacity expands horizontally under a single unified endpoint. New dedicated hardware joins the mesh as demand grows — your integration never changes.

Node-01

Single-tenant

Active / Online

Node-02

Reserved

Scaling Target

Node-03

Reserved

Scaling Target

Node-04

Reserved

Scaling Target

Node-05

Reserved

Scaling Target

Node-06

Reserved

Scaling Target

Node-07

Reserved

Scaling Target

Node-08

Reserved

Scaling Target

Online · full precision Capacity scaling on demand Single unified endpoint

Models & Pricing

Open weights, native precision, honest pricing.

Pay only for tokens, reconciled from metadata — never from your request content. Every model routes through the same dedicated endpoint, with more joining the catalog as we scale.

Live now

Gemma 4 12B IT

Flagship · FP8

Input / 1M tokens$0.05

Output / 1M tokens$0.15

Native FP8 — served at advertised precision, no downgrade under load

Dedicated single-tenant direct-route

~64ms typical TTFT

Page-level KV isolation

Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8

Route on OpenRouter

Coming Soon / Roadmap

Gemma 4 26B IT

A4B MoE · High-capacity

Input / 1M tokens$0.05

Output / 1M tokens$0.15

Mixture-of-Experts, deeper reasoning

Same direct-route guarantees

Advertised precision under load

Direct-to-architect support

Measured Jun 2026 · single RTX 5090 node · Gemma 4 26B MoE FP8

Route on OpenRouter

Start routing

Route your next request directly.

One dedicated, single-tenant node — not a shared queue — with zero data retained. Point OpenRouter at GWMM and feel the difference on the first token.

Try on OpenRouter View live status