Observability Query Languages - PromQL, LogQL, TraceQL¶
Reference for the three main observability query languages - Prometheus PromQL for metrics, Grafana Loki LogQL for logs, and Grafana Tempo TraceQL for traces. Includes OpenTelemetry integration and SRE golden signals.
Key Facts¶
- Three pillars of observability: Metrics (Prometheus), Logs (Loki), Traces (Tempo/Jaeger)
- Golden signals (SRE): Latency, Traffic, Errors, Saturation
- PromQL operates on time series with labels;
rate()calculates per-second rate over a window - LogQL uses label selectors + string/regex filters + JSON parsing pipeline
- TraceQL queries distributed traces by duration, service name, span attributes
- OpenTelemetry SDK provides unified instrumentation across all three pillars
Patterns¶
PromQL (Prometheus)¶
# Request rate (per second, 1-min window)
sum(rate(http_server_requests_total[1m]))
# Error rate by endpoint
sum(rate(http_server_requests_total{status!="200"}[1m])) by (method)
# Error percentage
(
sum(rate(http_server_requests_total{status!="200"}[1m]))
/
sum(rate(http_server_requests_total[1m]))
) * 100
# Latency percentiles (histogram)
histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket[1m])) by (le, method))
histogram_quantile(0.90, sum(rate(http_server_request_duration_seconds_bucket[1m])) by (le, method))
histogram_quantile(0.95, sum(rate(http_server_request_duration_seconds_bucket[1m])) by (le, method))
histogram_quantile(0.99, sum(rate(http_server_request_duration_seconds_bucket[1m])) by (le, method))
# Average latency
sum(rate(http_server_request_duration_seconds_sum[1m])
/ rate(http_server_request_duration_seconds_count[1m])) by (method)
LogQL (Grafana Loki)¶
# Basic label selector
{compose_service="app", level="info"}
# String filter operators
{..} |= "server" # contains
{..} != "debug" # not contains
{..} |~ "err.*500" # regex match
{..} !~ "health" # regex not match
# JSON log parsing pipeline
{app_name="my-app"} | json | method = "GET" and status != "200"
# Rate of error log lines
sum(rate({app_name="my-app"} | json | level = "error" [1m])) by (method)
# Latency quantile from logs
quantile_over_time(0.9,
{app_name="my-app"} | json | code=~"400|500"
| unwrap duration[1m]) by (method)
TraceQL (Grafana Tempo)¶
# Find slow traces
{duration > 5s}
# By service and status
{resource.service.name="order-service" && span.http.response.status_code >= 400}
# Rate of spans
{resource.service.name="order-service"} | rate() by (span.http.response.status_code)
# Latency quantiles
{span.http.response.status_code = 200} | quantile_over_time(duration, .999, .99, .9)
OpenTelemetry in Go¶
// Trace context propagation through gRPC metadata
type metadataCarrier struct { md metadata.MD }
func (m *metadataCarrier) Get(key string) string { ... }
func (m *metadataCarrier) Set(key, value string) { ... }
func (m *metadataCarrier) Keys() []string { ... }
// In interceptor:
ctx = otel.GetTextMapPropagator().Extract(ctx, &metadataCarrier{md: incomingMD})
Instrument HTTP and gRPC with middleware/interceptors. Export to Jaeger, Tempo, or any OTLP collector.
Golden Signals¶
| Signal | What it measures | Example metric |
|---|---|---|
| Latency | How long requests take | histogram_quantile(0.99, ...) |
| Traffic | Demand (RPS) | sum(rate(http_requests_total[1m])) |
| Errors | Rate of failed requests | rate(http_requests_total{status!="200"}[1m]) |
| Saturation | How "full" the service is | CPU, memory, queue depth |
Structured Logging¶
Use slog (Go 1.21 stdlib) or zap/zerolog. Always log with context: request ID, user ID, trace ID. JSON format for production, text for development.
Gotchas¶
rate()in PromQL requires a counter (monotonically increasing); using it on a gauge produces nonsensehistogram_quantileoperates on histogram buckets - if bucket boundaries don't cover your data range, results are inaccurate- LogQL
unwraponly works on parsed numeric fields - ensure JSON parsing extracts the field correctly - TraceQL
durationis span-level; for total trace duration use root span or trace-level queries - OpenTelemetry SDK adds overhead - tune sampling rate for high-throughput services
See Also¶
- go microservices - gRPC interceptors for telemetry injection
- kafka messaging - monitoring consumer lag with metrics