Kimi K2.6 NVFP4
nvidia/Kimi-K2.6-NVFP4
Large model good for coding, agents, and tool use.
modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4 “Modal lets us move fast while keeping full control over our models and serving stack. The flexibility meant we could train high-accuracy models and hit the real-time performance our product demands.”
“We use Modal to run edge inference with <10ms overhead and batch jobs at large scale. Our team loves the platform for the power and flexibility it gives us.”
Stay in your application code. Modal handles scaling, serving, and infrastructure behind the scenes.
Serve open source or custom models with Python. Easily keep ML dependencies and GPU requirements in sync with application code.
Optimized and highly tunable infrastructure for low-latency serving and routing.
Instantly scale to 1000+ GPUs during traffic spikes, then back down to 0 when idle. No commitments, no waits.
nvidia/Kimi-K2.6-NVFP4
Large model good for coding, agents, and tool use.
modal endpoint create kimi-k2-6-nvfp4 --model nvidia/Kimi-K2.6-NVFP4 Qwen/Qwen3.6-35B-A3B
Model tuned for fast chat, reasoning, and extraction.
modal endpoint create qwen3-6-35b-a3b --model Qwen/Qwen3.6-35B-A3B google/gemma-4-E4B-it
Compact instruction model for lightweight workloads.
modal endpoint create gemma-4-e4b-it --model google/gemma-4-E4B-it Select from our full catalog of models, or bring your own weights from Hugging Face or a Volume.
Real-time
Sub-10ms overhead latency from anywhere with our globally distributed compute. Out-of-the-box support for token streaming, WebRTC, WebSocket.
Dynamically batched
Add one line of code to accumulate requests and process them in dynamically-sized batches.
Offline batched
Run inference on millions of inputs in record time—Modal scales instantly to thousands of GPUs.
