Optimized inference you actually own. Try Modal Auto Endpoints

Inference you actually own

Your models, Modal's scale.
customer logo

“Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”

Raunak Chowdhuri, Founder
customer logo

“Modal powers both our reinforcement learning infrastructure and production inference. Millions of sandboxes on one end, real-time serving on the other.”

Scott Wu, CEO
customer logo

“Modal makes it unbelievably quick to deploy our models onto scalable infrastructure. We’ve been able to move faster on our last few model launches, including Olmo and Tülu, thanks to the platform.”

Michael Schmitz, Engineering

Deploy optimized inference in seconds.

Kimi K2.6 NVFP4

nvidia/Kimi-K2.6-NVFP4

Large model good for coding, agents, and tool use.

modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4
Deploy this model

Qwen3.6 35B A3B

Qwen/Qwen3.6-35B-A3B

Model tuned for fast chat, reasoning, and extraction.

modal endpoint create qwen3-6-35b-a3b --model Qwen/Qwen3.6-35B-A3B
Deploy this model

Gemma 4 E4B IT

google/gemma-4-E4B-it

Compact instruction model for lightweight workloads.

modal endpoint create gemma-4-e4b-it --model google/gemma-4-E4B-it
Deploy this model

Built for performance

Provider A
290ms
Modal (baseline)
290ms
Provider B
250ms
Modal (+custom spec)
190ms

4x faster with custom speculator models


Engineered for low-latency, high-throughput inference


Optimized for your workload

View the LLM Almanac

Autoscale to thousands of GPUs without reservations


Modal’s Rust-based container stack spins up GPUs in < 1s.


Modal autoscales up and down for max cost efficiency.


Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.

Fast and affordable

Unbeatable cost for batch inference

Save 50%+ on high-throughput, short-context tasks compared to API providers.

Unbeatable cost for batch inference

Sub-10ms network latency for online inference

Global GPU fleet runs close to your users, wherever they are. Support for inference optimizations like prefill disaggregation and prefix-aware routing.

Everything you need for production-grade deployments




Ship your first app in minutes.

Get Started

$30 / month free compute