Optimized inference you actually own. Try Modal Auto Endpoints

Modal Inference

The fastest way to scale Inference

Whether low-latency LLM inference or async batch workloads, Modal lets you serve, scale, and optimize inference globally.

Get Started

Read the docs

LLM Inference

“Modal lets us move fast while keeping full control over our models and serving stack. The flexibility meant we could train high-accuracy models and hit the real-time performance our product demands.”

Decagon, Voice AI team

Edge Inference

“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”

Brian Ichter, Co-founder

World Models

“Modal's infrastructure gave us the performance and reliability we need to ship this in every global region, at production scale.”

Kamil Sindi, CTO of Runway

Code-first inference

Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.

RUN ANY MODEL

REAL-TIME SERVING

ELASTIC SCALE

Modal Inference

Engineered for inference.

Run any model

Serve open source or custom models with Python. Easily keep ML dependencies and GPU requirements in sync with application code.

Real-time serving

Optimized and highly tunable infrastructure for low-latency serving and routing.

Provider A

290ms

Modal (baseline)

290ms

Provider B

250ms

Modal (+custom spec)

190ms

Elastic scale

Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.

Deploy optimized inference
in seconds.

Browse all models

View docs

Kimi K2.6 NVFP4

nvidia/Kimi-K2.6-NVFP4

Large model good for coding, agents, and tool use.

modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4

Deploy this model

Qwen3.6 35B A3B

Qwen/Qwen3.6-35B-A3B

Model tuned for fast chat, reasoning, and extraction.

modal endpoint create qwen3-6-35b-a3b --model Qwen/Qwen3.6-35B-A3B

Deploy this model

Gemma 4 E4B IT

google/gemma-4-E4B-it

Compact instruction model for lightweight workloads.

modal endpoint create gemma-4-e4b-it --model google/gemma-4-E4B-it

Deploy this model

Open or custom models, deployed with open inference engines and SOTA optimization.

Select from our full catalog of models, or bring your own weights from Hugging Face or a Volume.

GLM 5.2 FP8

DeepSeek V4 Pro

Gemma 4 26B A4B IT

Gemma 4 31B IT

NVIDIA Nemotron 3 Super 120B A12B NVFP4

GPT-OSS 120B

Qwen3.5 397B A17B FP8

Qwen3.6 27B

GLM 5.2 FP8

DeepSeek V4 Pro

Gemma 4 26B A4B IT

Gemma 4 31B IT

NVIDIA Nemotron 3 Super 120B A12B NVFP4

GPT-OSS 120B

Qwen3.5 397B A17B FP8

Qwen3.6 27B

Infrastructure optimized for every deployment pattern

Real-time

Sub-10ms overhead latency from anywhere with our globally distributed compute. Out-of-the-box support for token streaming, WebRTC, WebSocket.

Dynamically batched

Add one line of code to accumulate requests and process them in dynamically-sized batches.

Offline batched

Run inference on millions of inputs in record time—Modal scales instantly to thousands of GPUs.

Get clear insight into production deployments

Rich dashboard interface

Rich dashboard interface helps you track the overall health and resource usage of deployed models

Detailed logging

Debug fast by zooming into metrics, logs, and live statuses of specific inference calls

Get clear insight into production deployments

Built with Modal

All examples

Transcribe speech in batches with Whisper

Turn audio bytes into text at scale

Edit images with Flux Kontext

Transform images with SotA diffusion models

Serverless WebRTC

Stream YOLO detections on webcam footage in real time

Transcribe speech with Kyutai STT

Stream transcripts at the speed of speech

RAG Chat with PDFs

Use ColBERT-style, multimodal embeddings with a Vision-Language Model to answer questions about documents

Fold proteins with Chai-1

Predict molecular structures from sequences with SotA open source models

Document OCR job queue

Use Modal as an infinitely scalable job queue that can service async tasks from a web app

Deploy a TTS API with Chatterbox

Serve text-to-speech with Chatterbox to generate natural audio from text

Generate videos with Mochi

Use Mochi to generate short AI-powered videos from prompts

Embed documents with TEI

Generate text embeddings at scale with Amazon’s Text Embeddings Inference (TEI)

Your end-to-end ML lifecyle in one place

Seamlessly integrate data pre-processing, training, and serving.

Ship your first app in minutes.

Get Started

$30 / month free compute

The fastest way to scale Inference

Code-first inference

Engineered for inference.

Run any model

Real-time serving

Elastic scale

Deploy optimized inference in seconds.

Kimi K2.6 NVFP4

Qwen3.6 35B A3B

Gemma 4 E4B IT

Open or custom models, deployed with open inference engines and SOTA optimization.

Infrastructure optimized for every deployment pattern

Get clear insight into production deployments

Built with Modal

Transcribe speech in batches with Whisper

Edit images with Flux Kontext

Serverless WebRTC

Transcribe speech with Kyutai STT

RAG Chat with PDFs

Fold proteins with Chai-1

Document OCR job queue

Deploy a TTS API with Chatterbox

Generate videos with Mochi

Embed documents with TEI

Your end-to-end ML lifecyle in one place

Ship your first app in minutes.

Deploy optimized inference
in seconds.