Skills / Engineering / Modal Serverless ML

Modal Serverless ML

Run Python on serverless GPUs with Modal — define containers, GPUs, schedules, and web endpoints in pure Python. Covers function decorators, volumes, secrets, async, batching, and cold-start tuning.

This skill helps you deploy Python workloads on Modal's serverless platform. It writes @app.function and @app.cls decorators with image/GPU/secret specs, configures persistent volumes for model weights, builds web endpoints with FastAPI, implements batched inference with @batched, handles scheduled jobs with @app.schedule, manages cold starts via container pre-warming, and integrates with Modal Sandboxes for code execution. Covers cost optimization patterns and when Modal is cheaper than RunPod/Replicate/SageMaker.

modal serverless gpu python ml

When to use

Use when running LLM inference or fine-tuning on demand, building Python web APIs that need GPU access, batch processing jobs that scale to zero, or replacing a Kubernetes GPU cluster with a much simpler deployment.

Examples

vLLM endpoint on Modal

Deploy a vLLM-served Llama model on Modal

Deploy a vLLM endpoint on Modal serving Llama 3 70B on A100, with FastAPI web endpoint, persistent volume for weights, and OpenAI-compatible API shape

Batch transcription job

Run Whisper transcription as a batch job

Build a Modal batch job that transcribes 10k audio files from S3 with Whisper Large, with parallelism=50, results written back to S3

Modal Serverless ML

When to use

Examples

vLLM endpoint on Modal

Batch transcription job

Save to Wishlist