BentoML Model Serving
Deploy ML models as APIs with BentoML. Generates service definitions, runner configurations, model packaging, adaptive batching, GPU inference pipelines, and BentoCloud deployment configs.
This skill helps you serve machine learning models in production with BentoML. It generates Service definitions with API endpoints, configures Runners for CPU/GPU inference, packages models into Bentos with dependencies, implements adaptive batching for throughput optimization, creates multi-model inference pipelines, and deploys to BentoCloud or Kubernetes. Covers serving PyTorch, TensorFlow, Hugging Face Transformers, and ONNX models.
When to use
Use when deploying ML models as APIs, optimizing inference throughput with batching, creating multi-model pipelines, or packaging models for production serving.
Examples
LLM serving
Deploy a Hugging Face model as an API
Create a BentoML service that serves a Hugging Face text-generation model with GPU inference, adaptive batching, streaming responses, and API key authentication
Multi-model pipeline
Chain multiple models in a serving pipeline
Build a BentoML pipeline that chains an OCR model with a classification model: extract text from images, then classify the document type — with separate GPU runners for each