Modal Serverless ML
Run Python on serverless GPUs with Modal — define containers, GPUs, schedules, and web endpoints in pure Python. Covers function decorators, volumes, secrets, async, batching, and cold-start tuning.
This skill helps you deploy Python workloads on Modal's serverless platform. It writes @app.function and @app.cls decorators with image/GPU/secret specs, configures persistent volumes for model weights, builds web endpoints with FastAPI, implements batched inference with @batched, handles scheduled jobs with @app.schedule, manages cold starts via container pre-warming, and integrates with Modal Sandboxes for code execution. Covers cost optimization patterns and when Modal is cheaper than RunPod/Replicate/SageMaker.
When to use
Use when running LLM inference or fine-tuning on demand, building Python web APIs that need GPU access, batch processing jobs that scale to zero, or replacing a Kubernetes GPU cluster with a much simpler deployment.
Examples
vLLM endpoint on Modal
Deploy a vLLM-served Llama model on Modal
Deploy a vLLM endpoint on Modal serving Llama 3 70B on A100, with FastAPI web endpoint, persistent volume for weights, and OpenAI-compatible API shape
Batch transcription job
Run Whisper transcription as a batch job
Build a Modal batch job that transcribes 10k audio files from S3 with Whisper Large, with parallelism=50, results written back to S3