Observability#
The platform provides centralized logging, metrics collection, and distributed tracing. All three signals are opt-in per service and configured via the service catalog YAML.
Logging#
Logs from all pods are collected by a Fluentd DaemonSet and forwarded to Elasticsearch. You can search and visualize them in Kibana.
Format requirements#
Logs must be written to stdout in JSON format. The log aggregator uses the following fields:
Field |
Type |
Required |
Description |
|---|---|---|---|
|
ISO 8601 string |
Yes |
Timestamp |
|
string |
Yes |
|
|
string |
Yes |
Must match the service |
|
string |
Yes |
Human-readable message |
|
string |
No |
Trace ID for correlation with tracing |
|
string |
No |
Incoming request ID |
Non-JSON output is still collected but will not be indexed correctly, making it harder to search.
Examples#
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
logger.Info("payment processed",
zap.String("service", "payments"),
zap.String("order_id", orderID),
zap.Duration("duration", elapsed),
)
import logging, json
class JSONFormatter(logging.Formatter):
def format(self, record):
return json.dumps({
"time": self.formatTime(record),
"level": record.levelname.lower(),
"service": "your-service",
"message": record.getMessage(),
})
handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)
const pino = require("pino");
const logger = pino({ base: { service: "your-service" } });
logger.info({ order_id: orderId }, "payment processed");
Log levels#
Use log levels consistently:
Level |
When to use |
|---|---|
|
Detailed diagnostic information. Disabled in production by default. |
|
Normal operation events: request received, job completed. |
|
Unexpected but recoverable situations: retry triggered, deprecated API used. |
|
Operation failed. Always include the error message and enough context to investigate. |
Do not log sensitive data (passwords, tokens, card numbers, PII). The log aggregator has no automated redaction.
Metrics#
The platform uses Prometheus for metrics collection and Grafana for dashboards. Metrics are required for critical and standard tier services.
Exposing metrics#
Expose a /metrics endpoint in Prometheus exposition format, then enable scraping in your values.yaml:
deployment:
metrics:
enabled: true # the chart adds Prometheus annotations to the Service
port: 8080
path: /metrics # default, can be omitted
The chart adds the necessary prometheus.io/* annotations to your Service resource automatically — you don’t need to manage them in manifests.
Recommended metrics#
In addition to the default Go/Python runtime metrics, instrument these signals for every service:
http_requests_total{method, path, status} — request counter
http_request_duration_seconds{method, path} — latency histogram
db_query_duration_seconds{query} — database query latency
errors_total{type} — error counter by type
var requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "path"},
)
func init() {
prometheus.MustRegister(requestDuration)
}
from prometheus_client import Histogram, start_http_server
request_duration = Histogram(
"http_request_duration_seconds",
"HTTP request latency",
["method", "path"],
)
start_http_server(8080) # exposes /metrics on port 8080
Grafana dashboards#
When you register your service in the catalog with observability.dashboard set, that URL appears on your service’s catalog page. The platform team can help you create a starter dashboard — ask in #platform-observability.
A standard starter dashboard includes:
Request rate (RPS)
Error rate (4xx, 5xx)
Latency percentiles (p50, p95, p99)
CPU and memory usage
Pod restarts
Tracing#
Distributed tracing helps you understand the flow of a request across multiple services. The platform uses Jaeger as the tracing backend and OpenTelemetry as the instrumentation standard.
Tracing is optional but strongly recommended for services that make downstream HTTP or gRPC calls.
Setup#
Step 1 — Install the SDK:
go get go.opentelemetry.io/otel \
go.opentelemetry.io/otel/sdk \
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
pip install opentelemetry-api \
opentelemetry-sdk \
opentelemetry-exporter-otlp
Step 2 — Initialize the tracer at startup:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
exporter, _ := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector.platform.svc:4317"),
otlptracegrpc.WithInsecure(),
)
provider := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(provider)
tracer := otel.Tracer("your-service")
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector.platform.svc:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("your-service")
Step 3 — Propagate trace context in outgoing requests:
import "go.opentelemetry.io/otel/propagation"
// inject into outgoing HTTP request
propagation.TraceContext{}.Inject(ctx, propagation.MapCarrier(req.Header))
resp, err := http.DefaultClient.Do(req)
from opentelemetry.propagate import inject
headers = {}
inject(headers) # adds traceparent / tracestate headers
response = httpx.get("http://auth.security.svc/verify", headers=headers)
Finding traces#
Open Jaeger, select your service name from the dropdown, and search by time range or trace ID. Trace IDs are included in log entries as trace_id when you configure log–trace correlation.
Enabling trace–log correlation#
Include the current trace ID in your log entries so logs and traces link up in Kibana and Jaeger.
import "go.opentelemetry.io/otel/trace"
spanCtx := trace.SpanFromContext(ctx).SpanContext()
logger.Info("processing payment",
zap.String("trace_id", spanCtx.TraceID().String()),
zap.String("span_id", spanCtx.SpanID().String()),
)
from opentelemetry import trace
current_span = trace.get_current_span()
trace_id = format(current_span.get_span_context().trace_id, "032x")
logger.info("processing payment", extra={"trace_id": trace_id})
Alerting#
Alerts are defined in Prometheus alerting rules and routed to on-call via PagerDuty.
The platform provides a set of default alert rules for all services based on the following thresholds:
Alert |
Condition |
Severity |
|---|---|---|
High error rate |
|
warning |
Very high error rate |
|
critical |
High latency |
|
warning |
Pod restarts |
|
warning |
Pod not ready |
No ready pods for 2 min |
critical |
To customize alert thresholds for your service, add a alerts.yaml file to your deploy/ directory. Ask in #platform-observability for the schema.