Tutorial AI Infrastructure

How to Run Low-Latency AI Inference on the Cerebras CS-3 API

The Cerebras inference API runs Llama 3.3 70B at 2,314 tokens per second, but only if you configure streaming, model choice, and TCP warming correctly.

AnIntent Editorial

May 16, 2026

9 min read

How to Run Low-Latency AI Inference on the Cerebras CS-3 API

Photo by Tyler on Unsplash

The Cerebras inference API is the only public endpoint where a 70B-parameter model returns the first token in under 200 milliseconds and then streams output at speeds independent benchmarkers have measured above 2,300 tokens per second. This tutorial walks through standing up a working low-latency client, picking the right model for your workload, and avoiding the one configuration mistake that silently doubles your time-to-first-token. It is written for developers moving an existing OpenAI-compatible app onto Cerebras, not for ML researchers fine-tuning weights.

The speed gap is real and measurable. According to Artificial Analysis benchmarks discussed by Cerebras SVP Hagay Lupesko, Cerebras delivers 2,314 tokens per second on Llama 3.3 70B versus roughly 32 tokens per second on Amazon Bedrock with the same model, with time-to-first-token around 170 milliseconds. That order-of-magnitude difference is the entire reason this API exists.

Why the CS-3 Hits Speeds GPUs Cannot

The architectural story matters because it explains which workloads will actually feel faster and which will not. Cerebras' official chip page lists the WSE-3 at 46,225 mm² with 4 trillion transistors, 900,000 AI-optimized cores, and 125 petaflops of AI compute, which the company says is 19 times more transistors and 28 times more compute than the NVIDIA B200. Black Scarab's WSE-3 guide notes the wafer is fabricated on TSMC's 5nm process.

The number that actually decides inference latency is memory bandwidth. HeyGoTrade's wafer-versus-GPU breakdown puts WSE-3 memory bandwidth at 21 PB/s, roughly 2,625 times the NVIDIA B200's figure. Cerebras' chip documentation adds that the 44 GB of on-chip SRAM sits next to the compute cores, eliminating the repeated off-chip DRAM reads that dominate per-token latency on conventional GPU stacks.

That is the bottleneck Cerebras targets. As Black Scarab explains, model weights on a GPU have to be pulled from external DRAM across buses and interconnects for every token generated, which is the memory wall. The CS-3 keeps the weights resident on the wafer, so generation becomes a compute exercise rather than a bus contention exercise.

There is a catch the marketing pages bury. For models that exceed a single wafer's 44 GB SRAM, Cerebras implements pipeline parallelism across multiple wafers, splitting layers and flowing generation sequentially. Steady-state token throughput holds up. Time-to-first-token can creep upward. If you are building a voice agent where first-token latency matters more than sustained throughput, prefer a model that fits on a single wafer over a larger one running pipelined.

What You Actually Get From the Cerebras Inference API

The API surface is intentionally boring, which is the point. Cerebras' Python SDK on GitHub ships an OpenAI-shaped client, and the Promptfoo provider docs confirm the chat completions endpoint is a drop-in replacement for OpenAI's, supporting structured outputs with JSON schemas and tool calling.

Model availability shifts month to month. The Cerebras model catalog lists production models including Llama variants and gpt-oss-120b alongside preview models intended for evaluation only, with a deprecation notice on the page indicating llama3.1-8b and qwen-3-235b-a22b-instruct-2507 will be deprecated on May 27, 2026. Treat preview models as throwaway. They get pulled without long lead times.

On quality, the Cerebras model catalog states that public endpoints serve unpruned, original models, with selective weight-only quantization used during storage while sensitive layers stay at full precision with on-the-fly dequantization. That matters if you have ever been burned by a fast endpoint that turned out to be a 4-bit quant masquerading as the full model.

Get a Key, Send Your First Request

The five steps below cover everything required for a working low-latency client against the Cerebras inference API:

Create an account at cloud.cerebras.ai and generate an API key from the dashboard.
Export it as CEREBRAS_API_KEY in your shell environment.
Install the SDK with pip install cerebras-cloud-sdk (Python 3.9 or newer is required, per the SDK README).
Pick a production model from the catalog. For latency-bound work, start with llama-3.3-70b.
Construct the client once at process start and reuse it for every request.

A minimal streaming call in Python looks like this:

import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(api_key=os.environ["CEREBRAS_API_KEY"])

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Summarise the memory wall in 40 words."}],
    stream=True,
)

for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

That is the entire surface area for a basic call. If you already have an OpenAI client wired into your codebase, the Cerebras provider documentation on Cloudflare and liteLLM's Cerebras page both confirm you can also point an OpenAI SDK at the Cerebras base URL and change nothing else.

The One Configuration That Silently Doubles Your Latency

Reconstructing the SDK client on every request is the single most common Cerebras CS-3 tutorial mistake. The official Python SDK warms TCP connections by sending a few requests to /v1/tcp_warming at construction to reduce time-to-first-token, and the README explicitly warns that repeatedly reconstructing the SDK instance leads to poor performance. The fix is one line of architecture: build the client at module scope or in your framework's startup hook, then inject it.

If you are running in a serverless environment where cold starts kill you, set warm_tcp_connection=False in the constructor to skip the warmup probes, then handle pooling at a higher layer with a long-lived gateway. The warmup is helpful only when the client survives long enough to amortise its cost.

The second latency trap is forgetting to stream. A blocking chat.completions.create call without stream=True waits for the entire generation before returning, which throws away the throughput advantage that justifies using Cerebras in the first place. Always stream when you are user-facing.

Picking the Right Model for the Workload

The model catalog has expanded well beyond Llama. The Cerebras inference examples repository notes that the API now serves OpenAI's GPT-OSS, Meta's Llama family, and Alibaba's Qwen models, with structured outputs and tool calling supported across the production tier.

For a Cerebras token throughput benchmark to anchor decisions against, the Cerebras Llama 4 launch announcement cites Artificial Analysis measuring over 2,600 output tokens per second on Llama 4 Scout with an end-to-end response time of 0.5 seconds. On reasoning models, the Cerebras DeepSeek R1 announcement clocks DeepSeek-R1-Distill-Llama-70B at more than 1,500 tokens per second.

For a Cerebras vs NVIDIA inference speed comparison on reasoning workloads specifically, HeyGoTrade reports that on a 1,024-input, 4,096-output-token reasoning workload Cerebras claims the CS-3 runs 21 times faster than a DGX B200 system, with each CS-3 drawing 23 kW and landing roughly 32 percent below B200 total cost of ownership when capex and energy opex are combined. Cerebras' own broader claim from its chip page is up to 15 times faster inference and lower infrastructure cost than GPU clouds, with the standard disclaimer that performance varies by workload.

A practical workload-to-model map:

Voice agents and chat UIs: Llama 3.3 70B for the best balance of quality and per-user latency.
Frontier reasoning with chain-of-thought: a DeepSeek R1 distill or gpt-oss-120b, accepting slightly higher time-to-first-token in exchange for output quality.
Long-context document work: a 128K-context Llama variant. The Cerebras Llama 405B blog post reports 240 ms time-to-first-token at 128K context, which is the figure to beat from any GPU API.
Cheap classification at scale: stick with an 8B model while it remains in the production tier, and watch the deprecation notices.

When You Should Not Use Cerebras

No wafer scale engine API guide is complete without acknowledging the trade-offs. Black Scarab is direct that the CS-3 is not edge hardware in the traditional sense and does not sit inside a camera, drone, or portable device; its value is as centralised reasoning infrastructure. If your workload runs on-device or in an air-gapped environment without external connectivity, this API is not your tool.

The ecosystem story is also a real cost. HeyGoTrade notes Nvidia delivered $215.9 billion in fiscal 2026 revenue against Cerebras' $510 million in 2025, roughly 423 times the revenue, anchored by the CUDA software moat and about 90 percent of AI accelerator market share. Custom kernels, niche fine-tuning tooling, and most third-party MLOps integrations still assume CUDA first.

And the bear case has teeth. On the day of the Cerebras IPO covered by CNBC, Davidson analysts described the product as "niche-y" and warned the wafer is still in early stages of maturity and less flexible than existing AI chip systems. The same report notes shares priced at launch, closed the first day at $331.07 for a roughly $95 billion market cap, then fell 10 percent on day two. If you are betting a production stack on a single-vendor architecture, model fallback paths to an OpenAI-compatible second provider into your client layer from day one. Some teams find more durable guidance in our notes on AI infrastructure architecture and on resilient developer tooling choices.

Measure Your Own Latency Before You Trust Anyone's Number

Marketing numbers are point measurements. Real production latency depends on your prompt length, output length, region, network path, and concurrency. A worthwhile benchmark script records four things on every request: time-to-first-token, total wall-clock, tokens per second across the streamed output, and the prompt and completion token counts returned in the response.

Run it against your actual prompts, not synthetic ones, at the concurrency you expect in production. The Cerebras 405B benchmark blog sets a useful reference point: 969 output tokens per second on Llama 3.1 405B with 240 ms time-to-first-token and 128K context. If your numbers are within a factor of two of that on the matching model, your client is configured correctly. If they are an order of magnitude off, revisit TCP warming, streaming, and whether you are accidentally hitting a preview endpoint.

The next step is wiring a circuit breaker that falls back to an OpenAI-compatible secondary provider when Cerebras returns a 429 or 5xx. That is what turns a fast endpoint into a production dependency rather than a demo.

Frequently Asked Questions

Is the Cerebras inference API compatible with the OpenAI SDK?

Yes. Cerebras offers an OpenAI-compatible chat completions endpoint that works as a drop-in replacement for OpenAI's API, supporting structured outputs with JSON schemas and tool calling. You can either use the official cerebras-cloud-sdk Python library or point an existing OpenAI client at the Cerebras base URL.

How much does Cerebras inference cost compared to GPU clouds?

At launch, Cerebras priced Llama 3.1 8B at 10 cents per million tokens and Llama 3.1 70B at 60 cents per million tokens on its Developer Tier, with Llama 3.1 405B at $6 per million input tokens and $12 per million output tokens. HeyGoTrade reports CS-3 total cost of ownership lands roughly 32 percent below an NVIDIA DGX B200 system when capex and energy opex are combined.

Can Cerebras run models larger than 44 GB of weights?

Yes. For models that exceed a single wafer's 44 GB on-chip SRAM, Cerebras uses pipeline parallelism, splitting layers across multiple wafers with generation flowing sequentially. Steady-state token throughput is maintained, though time-to-first-token may increase slightly compared to a model that fits on a single wafer.

Which large customers actually use Cerebras inference in production?

OpenAI signed a multi-year agreement valued above $20 billion in January 2026 to anchor Cerebras for dedicated low-latency inference, and Meta uses Cerebras for select Llama inference workloads at production scale. Cerebras also powers Mistral's Le Chat assistant and Perplexity's Sonar model.

Why did Cerebras stock drop after its IPO if the technology is faster than GPUs?

Cerebras priced its Nasdaq debut at $185 on May 14, 2026, closed day one at $331.07 for roughly a $95 billion market cap, then fell 10 percent on day two. Davidson analysts called the product niche and noted the wafer is still in early stages of maturity and less flexible than existing AI chip systems, while Nvidia continues to hold approximately 90 percent of AI accelerator market share.

How to Run Low-Latency AI Inference on the Cerebras CS-3 API

Why the CS-3 Hits Speeds GPUs Cannot

What You Actually Get From the Cerebras Inference API

Get a Key, Send Your First Request

The One Configuration That Silently Doubles Your Latency

Picking the Right Model for the Workload

When You Should Not Use Cerebras

Measure Your Own Latency Before You Trust Anyone's Number

Frequently Asked Questions

Keep reading

How Cerebras' Wafer-Scale Engine Works and Why It Rattles Nvidia

MRC Protocol: How OpenAI's Ethernet Rewrite Scales to 100,000 GPUs

Anthropic Leases All of SpaceX's Colossus 1 to Ease Claude Capacity Crunch