Skills / Community / BentoML Model Serving

BentoML Model Serving

Deploy ML models as APIs with BentoML. Generates service definitions, runner configurations, model packaging, adaptive batching, GPU inference pipelines, and BentoCloud deployment configs.

This skill helps you serve machine learning models in production with BentoML. It generates Service definitions with API endpoints, configures Runners for CPU/GPU inference, packages models into Bentos with dependencies, implements adaptive batching for throughput optimization, creates multi-model inference pipelines, and deploys to BentoCloud or Kubernetes. Covers serving PyTorch, TensorFlow, Hugging Face Transformers, and ONNX models.

bentoml ml-serving inference deployment gpu

When to use

Use when deploying ML models as APIs, optimizing inference throughput with batching, creating multi-model pipelines, or packaging models for production serving.

Examples

LLM serving

Deploy a Hugging Face model as an API

Create a BentoML service that serves a Hugging Face text-generation model with GPU inference, adaptive batching, streaming responses, and API key authentication

Multi-model pipeline

Chain multiple models in a serving pipeline

Build a BentoML pipeline that chains an OCR model with a classification model: extract text from images, then classify the document type — with separate GPU runners for each
Added to wishlist