Direct-route inference · online (last-known)

Direct-Route
AI Inference

Every request runs on a dedicated, single-tenant instance— not a shared, queue-based pool — so your first token isn't stuck behind other tenants' traffic. The result: consistent, low TTFT and zero data retention, every call.

TTFT (last-known)
64ms
Throughput (last-known)
62 tok/s
Uptime 30d
99.98%
Data Retained
0 bytes
Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8
Why direct-route

Built for developers who can't afford a slow first token.

Four architectural guarantees that separate a dedicated node from queue-based routing.

Single-Instance Direct-Route

Requests route straight to a dedicated, single-tenant instance, bypassing the global scheduling queues of queue-based routing. Execution starts immediately — ultra-low, consistent TTFT on every call.

Zero middleware delay

Zero Data Retention + KV Isolation

Prompts and completions live entirely in GPU memory and are destroyed the instant a request completes. Page-level KV cache isolation makes cross-session context pollution physically impossible.

Memory-only · no disk write

Horizontal Scaling Fleet

The gateway scales horizontally. As volume grows, dedicated hardware instances are provisioned into the routing mesh under one unified endpoint — seamless capacity, no cold pools.

Unified endpoint

Guaranteed Performance Integrity

Models run at advertised precision at all times — no dynamic quantization downgrades, no context truncation under load. Plus direct-to-architect support, 24/7, bypassing tier-1 ticket queues.

No degradation · direct support
Zero Data Retention

Your data's entire lifecycle. It never touches a disk.

Every request flows through a memory-only pipeline and is destroyed the instant it completes. There is no logging stage to opt out of — persistence simply doesn't exist in the path.

User Request

Prompt + params

Routing Layer

Request ingress

Secure Edge

TLS / HTTPS

Inference Gateway

Direct-route proxy

In-Memory GPU

Paged KV, no disk

Immediate Destruction

0 bytes persisted

No disk writes Page-level KV isolation No cross-session pollution
TTFT Consistency

A flat line is the whole point.

On queue-based routing, first-token latency rises and falls with how deep the shared queue runs at that moment. A dedicated direct-route sidesteps the queue entirely, holding a flat line request after request.

Queue-based routingGWMM Single-Instance Direct-Route
Time to First Token · ms
03006009001200
Direct-route typical
64 ms
Direct-route jitter
±6 ms
Queued peak *
1,240 ms
Consistency gain *
19.4×
* shared pool metrics are illustrative of average queue-based routing degradation under load
Guest Playground

Feel the speed. No signup.

Query a live model and watch tokens render the instant they arrive — no typewriter easing, no buffering. What you see is the raw stream.

direct-route node
GW
Ask anything. You're talking to a dedicated direct-route node — responses stream raw, at the speed the GPU produces them.
5 of 5 free guest queries remainingGet unlimited on OpenRouter
Scaling Fleet Roadmap

One node today. A routing mesh tomorrow.

Capacity expands horizontally under a single unified endpoint. New dedicated hardware joins the mesh as demand grows — your integration never changes.

Node-01
Single-tenant
Active / Online
Node-02
Reserved
Scaling Target
Node-03
Reserved
Scaling Target
Node-04
Reserved
Scaling Target
Node-05
Reserved
Scaling Target
Node-06
Reserved
Scaling Target
Node-07
Reserved
Scaling Target
Node-08
Reserved
Scaling Target
Online · full precision Capacity scaling on demand Single unified endpoint
Models & Pricing

Open weights, native precision, honest pricing.

Pay only for tokens, reconciled from metadata — never from your request content. Every model routes through the same dedicated endpoint, with more joining the catalog as we scale.

Live now

Gemma 4 12B IT

Flagship · FP8
Input / 1M tokens$0.05
Output / 1M tokens$0.15
Native FP8 — served at advertised precision, no downgrade under load
Dedicated single-tenant direct-route
~64ms typical TTFT
Page-level KV isolation
Measured Jun 2026 · single RTX 5090 node · Gemma 4 12B FP8
Route on OpenRouter
Coming Soon / Roadmap

Gemma 4 26B IT

A4B MoE · High-capacity
Input / 1M tokens$0.05
Output / 1M tokens$0.15
Mixture-of-Experts, deeper reasoning
Same direct-route guarantees
Advertised precision under load
Direct-to-architect support
Measured Jun 2026 · single RTX 5090 node · Gemma 4 26B MoE FP8
Route on OpenRouter
Start routing

Route your next request directly.

One dedicated, single-tenant node — not a shared queue — with zero data retained. Point OpenRouter at GWMM and feel the difference on the first token.