vLLM Inference Server
Deploy high-throughput LLM inference with vLLM. Generates server configurations, model loading scripts, PagedAttention tuning, LoRA adapter serving, structured output schemas, and GPU memory optimization.
This skill helps you serve LLMs in production with vLLM's high-throughput inference engine. It configures OpenAI-compatible API servers, tunes PagedAttention for optimal GPU memory usage, sets up multi-LoRA adapter serving, implements structured output with guided decoding, configures tensor parallelism for multi-GPU, creates Docker deployments with CUDA, and benchmarks throughput. Covers Hugging Face model loading, quantization (AWQ/GPTQ), and speculative decoding.
When to use
Use when self-hosting LLMs for production inference, optimizing GPU memory with PagedAttention, serving multiple LoRA adapters, or deploying OpenAI-compatible API endpoints.
Examples
Production server
Deploy a model with optimized settings
Configure a vLLM server for Llama 3.1 70B with AWQ quantization, tensor parallelism across 2 GPUs, structured output, and Docker deployment with health checks
Multi-LoRA serving
Serve multiple fine-tuned adapters
Set up vLLM to serve a base model with 5 LoRA adapters for different use cases, with dynamic adapter loading and per-request adapter selection via the API