Home / Inference APIs

🤖 Inference APIs

APIs and runtimes for AI models, especially LLMs, enable powerful text generation and processing in apps. They serve as the foundation for many AI solutions and allow easy integration, making advanced AI accessible for developers.

Read our comparison guide →

🕵️‍♀️ Agents 🔊 Audio 🧠 Fine-tuning 🏗️ Frameworks & Stacks 🤖 Inference APIs 📊 Observability & Analytics ✍️ Prompt engineering 🗄️ Vector databases

79 tools

Featured

Lyceum

EU-hosted inference cloud for open-source models, OpenAI-compatible

Free Trial

Get featured?

Hosted Inference APIs 44

WAYSCloud

Norwegian cloud platform with an OpenAI-compatible LLM API running open-weight models in Oslo

IONOS AI Model Hub

OpenAI-compatible API for open-weight LLMs and image models, hosted in IONOS EU data centers

Opper

EU-hosted AI gateway serving 300+ models through one OpenAI-compatible API

TokensMind

Unified OpenAI-compatible API gateway to 100+ models across providers

Free Trial

Tokenware

Unified OpenAI-compatible API to 200+ models with smart routing and failover

Free Trial

Fast Pivot

Unified OpenAI-compatible API for routing across 300+ models from 50+ providers

FerryAPI

OpenAI-compatible API gateway with prepaid balance and usage billing

SiliconFlow

OpenAI-compatible API serving 200+ open-source LLM and multimodal models

Free Trial

OurToken

Unified OpenAI-compatible API gateway that routes requests across multiple LLM providers

Geodd

Managed AI inference endpoints and GPU infrastructure with OpenAI-compatible API

LibertAI

Decentralized, privacy-first inference API running open-source LLMs in trusted execution environments

LLMBase

EU-hosted inference API with 30+ open-source models, OpenAI-compatible, GDPR-compliant

CheapestInference

Flat-rate unlimited inference on open-weight models, sold in daily 8-hour windows

Tensorix

EU-sovereign inference API with 50+ open-source models and zero data retention

EUrouter

European AI gateway that routes to 100+ models with EU data residency

CodingPlanX

Unified AI API gateway providing access to 600+ models from OpenAI, Anthropic, Google, DeepSeek, and more

Free Trial

IonRouter

High-throughput inference API with OpenAI-compatible access to open-source models at half market rate

Free Trial

Infercom

European sovereign AI inference with OpenAI-compatible APIs hosted in EU datacenters

Free Trial

LLMWise

Multi-LLM API orchestration platform for comparing and blending AI models

Free Trial

Cortecs AI

European AI inference gateway with smart routing across EU providers

Free Trial

Scaleway

European serverless AI inference APIs, 100% hosted in Europe

Free Trial

Nscale

European AI hyperscaler with serverless inference and GPU cloud

Free Trial

Berget AI

EU-sovereign AI inference platform with OpenAI-compatible API

Free Trial

Cloudflare Workers AI

Run AI models at the edge on Cloudflare's global network with serverless inference

Free Trial

Screenshot of Cloudflare Workers AI webpage

Cerebras

Ultra-fast inference on custom wafer-scale hardware with OpenAI-compatible API

Free Trial

SambaNova

Custom AI chip inference platform with purpose-built hardware for high-throughput LLM serving

Free Trial

OpenRouter

Unified API for 400+ AI models across 60+ providers, OpenAI SDK-compatible, pay-as-you-go

Free Trial

Google Gemini API

Google's API for Gemini models with text, image, video, and audio capabilities

Free Trial

Amazon Bedrock

Managed API access to foundation models on AWS with built-in fine-tuning and agent tooling

Free Trial

DeepSeek

Cost-effective inference API with OpenAI-compatible endpoints and open-weight models

Open Source Free Trial

Synexa

Simple, fast, and stable. Deploy AI models with just one line of code.

novita.ai

APIs, Serverless and GPU Instance In One AI Cloud

Free Trial

Monster API

Access, finetune, deploy LLMs using our affordable and scalable APIs.

Free Trial

cohere

Cohere’s world-class LLMs help enterprises build powerful, secure applications that search, understand meaning and converse in text.

Free Trial

Anthropic Claude

Claude API for building AI applications with Opus, Sonnet, and Haiku models

Free Trial

Lepton

GPU compute marketplace from NVIDIA (formerly Lepton AI). Connects developers to 20+ cloud providers through one interface for training, dev pods, and inference.

OctoAI

OctoAI delivers production-grade GenAI solutions running on the most efficient compute, empowering builders to launch the next generation of AI applications.

Free Trial

Groq

LPU-powered inference API for LLMs, speech, and vision models with usage-based pricing

Free Trial

fireworks.ai

The production AI platform built for developers.

deepinfra

Run the top AI models using a simple API, pay per use. Low cost, scalable and production ready infrastructure.

Free Trial

Mistral

Use models in a few clicks with our platform. Download our open models for deep access.

Open Source

together.ai

The fastest cloud platform for building and running generative AI.

Anyscale

Fast, cost-efficient, serverless APIs for LLM Serving and Fine Tuning

OpenAI

API access to GPT, o-series reasoning, DALL-E, and Whisper models

Free Trial

GPU Clouds 19

Packet.ai

On-demand NVIDIA Blackwell GPU cloud with per-second billing, SSH, CLI, and an OpenAI-compatible inference API

General Compute

ASIC-powered inference cloud built for AI agents, OpenAI-compatible API

ARK Labs

Sovereign AI inference infrastructure for regulated EU environments, with heterogeneous GPU support

Free Trial

AKI.IO

European AI API for open-source models on EU infrastructure

Free Trial

evroc

European-sovereign cloud and inference APIs running open-source models on NVIDIA Blackwell GPUs in EU data centers

Theta EdgeCloud

Decentralized GPU cloud for AI inference, training, and containerized workloads

Open Source

Airon

Dedicated bare-metal GPU infrastructure for AI workloads, hosted in Nordic datacenters

AiQu

Swedish GPU infrastructure and LLM hosting platform with API-first deployment, no Kubernetes required

Free Trial

vMetal

Bare metal GPU server provisioning for companies building AI compute clouds

Hyperstack

On-demand cloud GPU platform for AI and ML workloads with per-minute billing

Verda

European GPU cloud with on-demand instances and serverless inference

Genesis Cloud

European GPU cloud for AI training and inference powered by 100% green energy

Free Trial

Taiga Cloud

European GPU cloud for AI training and inference by Northern Data Group

OVHcloud AI

European cloud provider with AI inference, training, and deployment services

Free Trial

Nebius

Full-stack AI cloud with GPU infrastructure for training and inference

Free Trial

Vast.ai

GPU marketplace for renting compute at market-driven prices with per-second billing

CoreWeave

GPU cloud infrastructure built for large-scale AI training and inference workloads

Lambda

GPU cloud for AI training and inference with on-demand and cluster options

Free Trial

RunPod

The Cloud Built for AI.

Serverless GPU 8

Prem AI

Fine-tune and deploy LLMs on your own infrastructure with full data sovereignty

Free Trial

Beam

Open-source serverless GPU cloud with sub-second cold starts and auto-scaling

Open Source Free Trial

Baseten

AI inference platform for deploying and serving ML models with autoscaling and optimized infrastructure

Free Trial

Cerebrium

Serverless GPU infrastructure for deploying AI models with sub-5 second cold starts

Free Trial

fal

Build the next generation of creativity with fal. Lightning fast inference.

Free Trial

Modal

Run generative AI models, large-scale batch jobs, job queues, and much more.

Free Trial

BentoML

BentoML is the platform for software engineers to build AI products.

Open Source Free Trial

Replicate

Run and fine-tune open-source models. Deploy custom models at scale. All with one line of code.

Other 7

Requesty

LLM gateway and router with one OpenAI-compatible API across 400+ models

Free Trial

Vercel AI Gateway

Unified API for hundreds of AI models, with built-in rate limiting and key management

Free Trial

Voyage AI

Embedding and reranker models for RAG retrieval quality, from MongoDB

Free Trial

Jina AI

Search APIs for embeddings, reranking, and web-to-markdown conversion

Free Trial

Miapi

Web-grounded AI answers API with citations, OpenAI-compatible, pay-per-query pricing

Free Trial

SGLang

High-performance open-source serving framework for LLMs and multimodal models

Open Source

vLLM

High-throughput LLM inference engine with PagedAttention for efficient GPU memory usage

Open Source Free Trial

Inference APIs overview

Inference API providers give developers access to large language models without managing GPU infrastructure. These services expose model endpoints via REST or gRPC APIs, handling scaling, load balancing, and hardware optimization behind the scenes. Most support the current open-source frontier (gpt-oss-20B and gpt-oss-120B, Kimi K2 family, Qwen3.5 family, GLM-5, DeepSeek V3.2, MiniMax-M2, NVIDIA Nemotron) alongside Meta Llama 3.x and 4.x and proprietary offerings.

The competitive landscape has shifted toward speed and cost efficiency. Providers differentiate on sustained throughput (tokens per second), time-to-first-token latency, supported model catalog, and pricing models (per-token, per-request, or reserved capacity). Custom inference hardware from Cerebras and Groq competes with GPU-based providers like DeepInfra, Together.ai, Fireworks, and Novita on different metrics. OpenRouter provides a unified gateway that routes requests across providers from a single API.

For teams building production applications, the choice of inference provider affects user experience directly. Factors like geographic availability, uptime SLAs, streaming support, function calling, structured outputs, fine-tuning options, and batch inference pricing all matter when selecting a provider.

Related stacks

See how inference apis tools fit into a full infrastructure stack.

🚀 Indie & Early Startup Stack 🤖 AI Agents Stack 🖥️ Self-Hosted Stack 🇪🇺 EU-Sovereign Stack

Frequently Asked Questions

What is an LLM inference API?

An LLM inference API is a hosted service that runs large language model predictions on your behalf. You send prompts via HTTP requests and receive generated text back. The provider handles GPU allocation, model loading, scaling, and optimization, so you can focus on building your application.

How do I choose between inference API providers?

Consider latency requirements, supported models, pricing structure, geographic availability, and features like streaming, function calling, and batch processing. Run benchmarks with your actual workload, as performance varies significantly by model size and prompt length.

Is it cheaper to self-host or use an inference API?

For most teams, inference APIs are more cost-effective until you reach consistent high-volume usage. Self-hosting requires GPU procurement, ops expertise, and handling idle capacity. APIs let you pay per token and scale elastically. The break-even point depends on your traffic patterns and model choices.

What is time-to-first-token and why does it matter?

Time-to-first-token (TTFT) measures the delay between sending a request and receiving the first token of the response. Lower TTFT creates a more responsive user experience, especially in chat interfaces. It is influenced by model size, hardware, request queue depth, and whether the provider uses speculative decoding or other optimization techniques.

Is your product missing?

Add it here →