Icon for Cerebras

Cerebras

Free Trial

Ultra-fast inference on custom wafer-scale hardware with OpenAI-compatible API

Cerebras provides AI inference powered by its custom Wafer-Scale Engine processors, delivering speeds up to 15x faster than GPU-based alternatives. The platform offers cloud, dedicated, and on-premise deployment options with support for open-source models including Llama, Qwen, and others. OpenAI API compatible, SOC2 and HIPAA certified.

Pricing: Per token usage

Hosting Cloud
Pricing Freemium, from Free tier available
HQ 🇺🇸 United States
Founded 2015
Screenshot of Cerebras webpage

What is Cerebras Inference?

Cerebras is an AI chip company that builds custom wafer-scale processors (WSE) designed specifically for AI workloads. The inference API service runs on this proprietary hardware, which trades GPU generality for dense on-chip memory and high sustained output speed on supported models.

Supported Models

The API currently supports a focused set of models, with published output speeds on the Cerebras pricing page: GPT-OSS 120B (~3,000 tokens/s), Llama 3.1 8B (~2,200 tokens/s), Qwen 3 235B Instruct (~1,400 tokens/s), and Z.ai GLM 4.7 (~1,000 tokens/s). Additional model families are available through dedicated endpoints for enterprise customers. All models are accessed through an OpenAI-compatible API, so switching from OpenAI or another compatible provider usually means changing a base URL and API key.

Free Tier and Pricing

Cerebras offers a free tier with access to all supported models and daily rate limits, useful for prototyping without a credit card. The Developer tier starts at $10 with 10x higher rate limits than the free tier. Enterprise pricing is custom and includes dedicated queue priority, fine-tuning, and support guarantees. Per-token pricing at the time of writing: GPT-OSS 120B at $0.35/$0.75 per 1M (input/output), Llama 3.1 8B at $0.10/$0.10, Qwen 3 235B at $0.60/$1.20, Z.ai GLM 4.7 at $2.25/$2.75.

Deployment Options

Three deployment tiers are available: Cloud (API key access to shared infrastructure), Dedicated (custom models on private cloud), and On-Premise (full control over hardware and data). The platform holds SOC 2 and HIPAA certifications, which matters for healthcare and regulated industry use cases.

Who Should Use Cerebras

Cerebras is a good fit for throughput-bound workloads where generated tokens per second is the primary constraint: batch generation, long-form output, reasoning chains, agent loops that produce many tokens before returning to the user. For workloads where first-token latency matters more than sustained throughput, the comparison worth running is against Groq's LPU stack. For broader model catalogs and lower per-token cost on current open-source frontier models (Kimi K2, Qwen3.5, DeepSeek V3.2, GLM-5, MiniMax-M2), GPU-based providers like DeepInfra, Together.ai, and Fireworks are worth benchmarking alongside.

Is your product missing?

Add it here →