Optimized inference you actually own. Try Modal Auto Endpoints
Modal Inference

The fastest way to scale Inference

Whether low-latency LLM inference or async batch workloads, Modal lets you serve, scale, and optimize inference globally.
customer logo
LLM Inference

“Modal lets us move fast while keeping full control over our models and serving stack. The flexibility meant we could train high-accuracy models and hit the real-time performance our product demands.”

Decagon, Voice AI team
customer logo
Edge Inference

“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”

Brian Ichter, Co-founder
customer logo
World Models

“Modal's infrastructure gave us the performance and reliability we need to ship this in every global region, at production scale.”

Kamil Sindi, CTO of Runway

Code-first inference

Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
Modal Inference

Engineered for inference.

Run any model

Serve open source or custom models with Python. Easily keep ML dependencies and GPU requirements in sync with application code.

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16

Real-time serving

Optimized and highly tunable infrastructure for low-latency serving and routing.

Provider A
290ms
Modal (baseline)
290ms
Provider B
250ms
Modal (+custom spec)
190ms

Elastic scale

Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.

Deploy optimized inference in seconds.

Kimi K2.6 NVFP4

nvidia/Kimi-K2.6-NVFP4

Large model good for coding, agents, and tool use.

modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4
Deploy this model

Qwen3.6 35B A3B

Qwen/Qwen3.6-35B-A3B

Model tuned for fast chat, reasoning, and extraction.

modal endpoint create qwen3-6-35b-a3b --model Qwen/Qwen3.6-35B-A3B
Deploy this model

Gemma 4 E4B IT

google/gemma-4-E4B-it

Compact instruction model for lightweight workloads.

modal endpoint create gemma-4-e4b-it --model google/gemma-4-E4B-it
Deploy this model

Infrastructure optimized for every deployment pattern




Get clear insight into production deployments

Get clear insight into production deployments


Built with Modal

Ship your first app in minutes.

Get Started

$30 / month free compute