Kimi K2.6 NVFP4
nvidia/Kimi-K2.6-NVFP4
Large model good for coding, agents, and tool use.
modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4 “Our ML engineers want to use Modal for everything. Modal helped reduce our VLM document parsing latency by 3x and allowed us to scale throughput to >100,000 pages per minute.”
“Modal powers both our reinforcement learning infrastructure and production inference. Millions of sandboxes on one end, real-time serving on the other.”
nvidia/Kimi-K2.6-NVFP4
Large model good for coding, agents, and tool use.
modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4 Qwen/Qwen3.6-35B-A3B
Model tuned for fast chat, reasoning, and extraction.
modal endpoint create qwen3-6-35b-a3b --model Qwen/Qwen3.6-35B-A3B google/gemma-4-E4B-it
Compact instruction model for lightweight workloads.
modal endpoint create gemma-4-e4b-it --model google/gemma-4-E4B-it Select from our full catalog of models, or bring your own weights from Hugging Face or a Volume.
4x faster with custom speculator models
Engineered for low-latency, high-throughput inference
Optimized for your workload
Modal’s Rust-based container stack spins up GPUs in < 1s.
Modal autoscales up and down for max cost efficiency.
Modal’s proprietary cloud capacity orchestrator guarantees high GPU availability.
Unbeatable cost for batch inference
Save 50%+ on high-throughput, short-context tasks compared to API providers.

Sub-10ms network latency for online inference
Global GPU fleet runs close to your users, wherever they are. Support for inference optimizations like prefill disaggregation and prefix-aware routing.
Volumes
Load LLM weights quickly from any region.
Observability
Intuitive dashboards help you navigate the health of your deployments.
Enterprise-grade security
SOC2 and HIPAA compliance, zero data retention, and more.