LLMOps¶

LLMOps adapts MLOps practices for LLM-specific workflows: prompt management, model serving, evaluation, monitoring, and cost control. The key difference from traditional MLOps is that iteration happens through prompt editing and RAG tuning, not model retraining.

Key Facts¶

Primary iteration loop: edit prompt -> run evaluation suite -> compare metrics -> deploy if improved
Biggest cost driver: token usage, especially output tokens
LLM-as-Judge is the most scalable automated evaluation method
Structured logging of every request is essential and disk is cheap
Prompt changes need regression testing just like code changes

LLMOps vs Traditional MLOps¶

Aspect	Traditional MLOps	LLMOps
Training	Custom model training	Prompt engineering or fine-tuning
Deployment	Model binary + inference server	API calls or self-hosted models
Versioning	Model weights + data	Prompts + configs + adapter weights
Evaluation	Fixed metrics (accuracy, F1)	LLM-as-judge, human eval, task-specific
Cost	Compute for training/inference	Token-based API pricing
Iteration	Retrain model	Edit prompt, adjust RAG

Evaluation Pipeline¶

LLM-as-Judge Pattern¶

def evaluate_response(question, response, reference, judge_llm):
    prompt = f"""Rate this response on 1-5:
    Question: {question}
    Response: {response}
    Reference: {reference}
    Rate on: accuracy, completeness, clarity.
    Output JSON: {{"accuracy": X, "completeness": X, "clarity": X}}"""
    return judge_llm.invoke(prompt)

Key Metrics¶

Metric	What It Measures
Faithfulness	Answer grounded in context?
Answer Relevancy	Addresses the question?
Context Precision	Retrieved docs relevant?
Context Recall	All relevant docs found?
Toxicity	Harmful content?
Latency	End-to-end response time
Cost	Token usage per query

Evaluation Frameworks¶

RAGAS: automated RAG evaluation
DeepEval: comprehensive LLM testing
LangSmith: integrated tracing + evaluation
Promptfoo: prompt testing and comparison
Arize Phoenix: LLM observability

Model Serving¶

Self-Hosted Options¶

vLLM: high-throughput with PagedAttention
TGI: HuggingFace inference server
Ollama: simple local serving
TensorRT-LLM: NVIDIA-optimized
llama.cpp: CPU-optimized for GGUF models

Serving Optimizations¶

KV-cache: cache key-value pairs for seen tokens
Continuous batching: process multiple requests simultaneously
Speculative decoding: small model drafts, large model verifies
Quantization: reduce precision for faster inference

Monitoring in Production¶

Key Metrics¶

Response latency (p50, p95, p99)
Token usage (input + output per request)
Error rate (API failures, malformed responses)
Hallucination rate (periodic human review)
User satisfaction (thumbs up/down)
Cost per conversation / per user
Retrieval quality (for RAG)

Structured Logging¶

{
    "request_id": "uuid",
    "timestamp": "2024-01-15T10:30:00Z",
    "user_id": "user123",
    "input_tokens": 500,
    "output_tokens": 200,
    "model": "gpt-4",
    "latency_ms": 1200,
    "tools_called": ["search_db", "calculate"],
    "cost_usd": 0.015,
    "feedback": null
}

Store full trace (intermediate steps, tool calls, LLM responses) for debugging.

Alerting¶

Response latency exceeds threshold
Error rate spikes
Token usage anomaly (possible injection attack)
Provider degradation/outage
Cost exceeds daily/monthly budget

Cost Optimization¶

Strategies¶

Smallest effective model: GPT-4o-mini for simple tasks, GPT-4 only for complex
Response caching: identical/similar queries return cached results
Prompt compression: minimize system prompt token count
Smart routing: easy questions to cheap models, hard to expensive
Token caching: Anthropic/OpenAI prompt caching (25% premium, then 10x cheaper)
Batch processing: group similar requests

Cost Formula¶

Cost/request = (input_tokens * input_price + output_tokens * output_price)

GPT-4o example: 1000 in + 500 out = ~$0.0075
At 10,000 requests/day: ~$75/day

CI/CD for LLM Applications¶

Code changes: standard CI/CD (tests, lint, deploy)
Prompt changes: run evaluation suite, compare against baseline, deploy if improved
Model changes: A/B test, monitor metrics, full rollout if stable
RAG data changes: re-index, run retrieval quality tests, deploy new index

Gotchas¶

LLM-as-Judge has biases (prefers verbose answers, favors certain styles) - calibrate with human labels
Evaluation metrics can be gamed - use multiple complementary metrics
Cost tracking must include ALL token sources (system prompts, function schemas, retries)
Prompt regression testing is essential - a prompt improvement for one case can break another
Production hallucination rate is hard to measure without periodic human review
Model provider updates can change behavior without notice - pin model versions where possible

LLMOps¶

Key Facts¶

LLMOps vs Traditional MLOps¶

Evaluation Pipeline¶

LLM-as-Judge Pattern¶

Key Metrics¶

Evaluation Frameworks¶

Model Serving¶

Self-Hosted Options¶

Serving Optimizations¶

Monitoring in Production¶

Key Metrics¶

Structured Logging¶

Alerting¶

Cost Optimization¶

Strategies¶

Cost Formula¶

CI/CD for LLM Applications¶

Gotchas¶

See Also¶