Optimized inference you actually own. Try Modal Auto Endpoints

Inference you actually own

Your models, Modal's scale.

“Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”

Raunak Chowdhuri, Founder

“Modal powers both our reinforcement learning infrastructure and production inference. Millions of sandboxes on one end, real-time serving on the other.”

Scott Wu, CEO

“Modal makes it unbelievably quick to deploy our models onto scalable infrastructure. We’ve been able to move faster on our last few model launches, including Olmo and Tülu, thanks to the platform.”

Michael Schmitz, Engineering

Deploy optimized inference
in seconds.

Browse all models

View docs

Kimi K2.6 NVFP4

nvidia/Kimi-K2.6-NVFP4

Large model good for coding, agents, and tool use.

modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4

Deploy this model

Qwen3.6 35B A3B

Qwen/Qwen3.6-35B-A3B

Model tuned for fast chat, reasoning, and extraction.

modal endpoint create qwen3-6-35b-a3b --model Qwen/Qwen3.6-35B-A3B

Deploy this model

Gemma 4 E4B IT

google/gemma-4-E4B-it

Compact instruction model for lightweight workloads.

modal endpoint create gemma-4-e4b-it --model google/gemma-4-E4B-it

Deploy this model

Open or custom models, deployed with open inference engines and SOTA optimization.

Select from our full catalog of models, or bring your own weights from Hugging Face or a Volume.

GLM 5.2 FP8

DeepSeek V4 Pro

Gemma 4 26B A4B IT

Gemma 4 31B IT

NVIDIA Nemotron 3 Super 120B A12B NVFP4

GPT-OSS 120B

Qwen3.5 397B A17B FP8

Qwen3.6 27B

GLM 5.2 FP8

DeepSeek V4 Pro

Gemma 4 26B A4B IT

Gemma 4 31B IT

NVIDIA Nemotron 3 Super 120B A12B NVFP4

GPT-OSS 120B

Qwen3.5 397B A17B FP8

Qwen3.6 27B

Built for performance

View the LLM Almanac

4x faster with custom speculator models

Engineered for low-latency, high-throughput inference

Optimized for your workload

Provider A

290ms

Modal (baseline)

290ms

Provider B

250ms

Modal (+custom spec)

190ms

4x faster with custom speculator models

Engineered for low-latency, high-throughput inference

Optimized for your workload

View the LLM Almanac

Autoscale to thousands of GPUs without reservations

Modal’s Rust-based container stack spins up GPUs in < 1s.

Modal autoscales up and down for max cost efficiency.

Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.

Modal’s Rust-based container stack spins up GPUs in < 1s.

Modal autoscales up and down for max cost efficiency.

Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.

Fast and affordable

Unbeatable cost for batch inference

Save 50%+ on high-throughput, short-context tasks compared to API providers.

Sub-10ms network latency for online inference

Global GPU fleet runs close to your users, wherever they are. Support for inference optimizations like prefill disaggregation and prefix-aware routing.

Everything you need for production-grade deployments

Volumes

Load LLM weights quickly from any region.

Observability

Intuitive dashboards help you navigate the health of your deployments.

Enterprise-grade security

SOC2 and HIPAA compliance, zero data retention, and more.

Ship your first app in minutes.

Get Started

$30 / month free compute

Inference you actually own

Deploy optimized inference in seconds.

Kimi K2.6 NVFP4

Qwen3.6 35B A3B

Gemma 4 E4B IT

Open or custom models, deployed with open inference engines and SOTA optimization.

Built for performance

Autoscale to thousands of GPUs without reservations

Fast and affordable

Everything you need for production-grade deployments

Ship your first app in minutes.

Deploy optimized inference
in seconds.