AI Observability: How to Monitor LLM Applications in Production
Monitoring AI applications is fundamentally different from traditional software. Here's how to build observability into your LLM system from day one.
You cannot operate what you cannot see. AI applications fail in ways traditional monitoring doesn't catch — hallucinations, prompt injections, quality degradation, cost spikes. You need AI-specific observability.
Why Traditional Monitoring Isn't Enough
Traditional monitoring checks: is the server up? Is latency acceptable? Is the error rate low?
AI applications can have perfect uptime, low latency, and zero errors — while silently giving wrong, harmful, or low-quality outputs. The failure mode is fundamentally different.
What to Monitor in AI Applications
1. Input/Output Logging
Log every request and response. This is non-negotiable.
```python
import langfuse
langfuse = Langfuse()
trace = langfuse.trace(name="customer-support-query")
generation = trace.generation(
name="llm-call",
model="claude-sonnet-4-6",
input=prompt,
output=response,
usage={"input": input_tokens, "output": output_tokens},
)
```
2. Token Usage and Cost
Track costs at every level — per request, per user, per feature.
```python
cost = (input_tokens * 0.000003) + (output_tokens * 0.000015) # Sonnet pricing
trace.score(name="cost_usd", value=cost)
```
3. Latency
Track end-to-end latency and break it down by component:
4. Quality Scores
Use LLM-as-judge to evaluate output quality automatically:
```python
def evaluate_quality(question: str, answer: str) -> float:
prompt = f"""
Rate this answer from 0-10 for accuracy, relevance, and helpfulness.
Question: {question}
Answer: {answer}
Return only the number.
"""
score = float(llm.invoke(prompt))
return score
```
5. Hallucination Detection
Check if answers are grounded in provided context:
```python
def check_faithfulness(context: str, answer: str) -> bool:
prompt = f"""
Does this answer contain only information present in the context?
Context: {context}
Answer: {answer}
Return YES or NO.
"""
return llm.invoke(prompt).strip() == "YES"
```
Observability Tools
LangSmith — best for LangChain applications, excellent tracing UI
Langfuse — open source, self-hostable, great for custom applications
Helicone — simple, OpenAI-focused, good for quick setup
Arize Phoenix — ML-focused, great for evaluations
Alerting
Set up alerts for:
Building a Feedback Loop
The most valuable observability comes from users: thumbs up/down, corrections, escalations. Feed this back into your evaluation dataset.
```python
# When user provides feedback
def record_feedback(trace_id: str, rating: int, comment: str):
langfuse.score(
trace_id=trace_id,
name="user_feedback",
value=rating,
comment=comment,
)
```
AI observability is non-negotiable for production systems. Talk to us about building monitoring into your AI application.
Ready to implement AI in your business?
Book a free 30-minute strategy call — no commitment required.
