Cerebras
Ultra-fast inference on custom wafer-scale hardware with OpenAI-compatible API
Cerebras provides AI inference powered by its custom Wafer-Scale Engine processors, delivering speeds up to 15x faster than GPU-based alternatives. The platform offers cloud, dedicated, and on-premise deployment options with support for open-source models including Llama, Qwen, and others. OpenAI API compatible, SOC2 and HIPAA certified.
Pricing: Per token usage
What is Cerebras Inference?
Cerebras is an AI chip company that builds custom wafer-scale processors (WSE) designed specifically for AI workloads. The inference API service runs on this proprietary hardware, which trades GPU generality for dense on-chip memory and high sustained output speed on supported models.
Supported Models
The API currently supports a focused set of models, with published output speeds on the Cerebras pricing page: GPT-OSS 120B (~3,000 tokens/s), Llama 3.1 8B (~2,200 tokens/s), Qwen 3 235B Instruct (~1,400 tokens/s), and Z.ai GLM 4.7 (~1,000 tokens/s). Additional model families are available through dedicated endpoints for enterprise customers. All models are accessed through an OpenAI-compatible API, so switching from OpenAI or another compatible provider usually means changing a base URL and API key.
Free Tier and Pricing
Cerebras offers a free tier with access to all supported models and daily rate limits, useful for prototyping without a credit card. The Developer tier starts at $10 with 10x higher rate limits than the free tier. Enterprise pricing is custom and includes dedicated queue priority, fine-tuning, and support guarantees. Per-token pricing at the time of writing: GPT-OSS 120B at $0.35/$0.75 per 1M (input/output), Llama 3.1 8B at $0.10/$0.10, Qwen 3 235B at $0.60/$1.20, Z.ai GLM 4.7 at $2.25/$2.75.
Deployment Options
Three deployment tiers are available: Cloud (API key access to shared infrastructure), Dedicated (custom models on private cloud), and On-Premise (full control over hardware and data). The platform holds SOC 2 and HIPAA certifications, which matters for healthcare and regulated industry use cases.
Who Should Use Cerebras
Cerebras is a good fit for throughput-bound workloads where generated tokens per second is the primary constraint: batch generation, long-form output, reasoning chains, agent loops that produce many tokens before returning to the user. For workloads where first-token latency matters more than sustained throughput, the comparison worth running is against Groq's LPU stack. For broader model catalogs and lower per-token cost on current open-source frontier models (Kimi K2, Qwen3.5, DeepSeek V3.2, GLM-5, MiniMax-M2), GPU-based providers like DeepInfra, Together.ai, and Fireworks are worth benchmarking alongside.
Cerebras Alternatives
Explore 61 products in the Inference APIs category. View all Cerebras alternatives.
EUrouter
European AI gateway that routes to 100+ models with EU data residency
AKI.IO
European AI API for open-source models on EU infrastructure
Jina AI
Search APIs for embeddings, reranking, and web-to-markdown conversion
Is your product missing?