Observability

Observability#

The platform provides centralized logging, metrics collection, and distributed tracing. All three signals are opt-in per service and configured via the service catalog YAML.

Logging#

Logs from all pods are collected by a Fluentd DaemonSet and forwarded to Elasticsearch. You can search and visualize them in Kibana.

Format requirements#

Logs must be written to stdout in JSON format. The log aggregator uses the following fields:

Field	Type	Required	Description
`time`	ISO 8601 string	Yes	Timestamp
`level`	string	Yes	`debug`, `info`, `warn`, `error`
`service`	string	Yes	Must match the service `name` in catalog
`message`	string	Yes	Human-readable message
`trace_id`	string	No	Trace ID for correlation with tracing
`request_id`	string	No	Incoming request ID

Non-JSON output is still collected but will not be indexed correctly, making it harder to search.

Examples#

Go

import "go.uber.org/zap"

logger, _ := zap.NewProduction()
logger.Info("payment processed",
    zap.String("service", "payments"),
    zap.String("order_id", orderID),
    zap.Duration("duration", elapsed),
)

Python

import logging, json

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "time": self.formatTime(record),
            "level": record.levelname.lower(),
            "service": "your-service",
            "message": record.getMessage(),
        })

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)

Node.js

const pino = require("pino");
const logger = pino({ base: { service: "your-service" } });
logger.info({ order_id: orderId }, "payment processed");

Log levels#

Use log levels consistently:

Level	When to use
`debug`	Detailed diagnostic information. Disabled in production by default.
`info`	Normal operation events: request received, job completed.
`warn`	Unexpected but recoverable situations: retry triggered, deprecated API used.
`error`	Operation failed. Always include the error message and enough context to investigate.

Do not log sensitive data (passwords, tokens, card numbers, PII). The log aggregator has no automated redaction.

Metrics#

The platform uses Prometheus for metrics collection and Grafana for dashboards. Metrics are required for critical and standard tier services.

Exposing metrics#

Expose a /metrics endpoint in Prometheus exposition format, then enable scraping in your values.yaml:

deployment:
  metrics:
    enabled: true   # the chart adds Prometheus annotations to the Service
    port: 8080
    path: /metrics  # default, can be omitted

The chart adds the necessary prometheus.io/* annotations to your Service resource automatically — you don’t need to manage them in manifests.

Recommended metrics#

In addition to the default Go/Python runtime metrics, instrument these signals for every service:

http_requests_total{method, path, status}   — request counter
http_request_duration_seconds{method, path} — latency histogram
db_query_duration_seconds{query}            — database query latency
errors_total{type}                          — error counter by type

Go

var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "path"},
)

func init() {
    prometheus.MustRegister(requestDuration)
}

Python

from prometheus_client import Histogram, start_http_server

request_duration = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["method", "path"],
)

start_http_server(8080)  # exposes /metrics on port 8080

Grafana dashboards#

When you register your service in the catalog with observability.dashboard set, that URL appears on your service’s catalog page. The platform team can help you create a starter dashboard — ask in #platform-observability.

A standard starter dashboard includes:

Request rate (RPS)
Error rate (4xx, 5xx)
Latency percentiles (p50, p95, p99)
CPU and memory usage
Pod restarts

Tracing#

Distributed tracing helps you understand the flow of a request across multiple services. The platform uses Jaeger as the tracing backend and OpenTelemetry as the instrumentation standard.

Tracing is optional but strongly recommended for services that make downstream HTTP or gRPC calls.

Setup#

Step 1 — Install the SDK:

Go

go get go.opentelemetry.io/otel \
       go.opentelemetry.io/otel/sdk \
       go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc

Python

pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp

Step 2 — Initialize the tracer at startup:

Go

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

exporter, _ := otlptracegrpc.New(ctx,
    otlptracegrpc.WithEndpoint("otel-collector.platform.svc:4317"),
    otlptracegrpc.WithInsecure(),
)
provider := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(provider)

tracer := otel.Tracer("your-service")

Python

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector.platform.svc:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("your-service")

Step 3 — Propagate trace context in outgoing requests:

Go

import "go.opentelemetry.io/otel/propagation"

// inject into outgoing HTTP request
propagation.TraceContext{}.Inject(ctx, propagation.MapCarrier(req.Header))
resp, err := http.DefaultClient.Do(req)

Python

from opentelemetry.propagate import inject

headers = {}
inject(headers)  # adds traceparent / tracestate headers
response = httpx.get("http://auth.security.svc/verify", headers=headers)

Finding traces#

Open Jaeger, select your service name from the dropdown, and search by time range or trace ID. Trace IDs are included in log entries as trace_id when you configure log–trace correlation.

Enabling trace–log correlation#

Include the current trace ID in your log entries so logs and traces link up in Kibana and Jaeger.

Go

import "go.opentelemetry.io/otel/trace"

spanCtx := trace.SpanFromContext(ctx).SpanContext()
logger.Info("processing payment",
    zap.String("trace_id", spanCtx.TraceID().String()),
    zap.String("span_id", spanCtx.SpanID().String()),
)

Python

from opentelemetry import trace

current_span = trace.get_current_span()
trace_id = format(current_span.get_span_context().trace_id, "032x")

logger.info("processing payment", extra={"trace_id": trace_id})

Alerting#

Alerts are defined in Prometheus alerting rules and routed to on-call via PagerDuty.

The platform provides a set of default alert rules for all services based on the following thresholds:

Alert	Condition	Severity
High error rate	`error_rate > 1%` for 5 min	warning
Very high error rate	`error_rate > 5%` for 2 min	critical
High latency	`p99 > 2s` for 5 min	warning
Pod restarts	`restart_count > 3` in 15 min	warning
Pod not ready	No ready pods for 2 min	critical

To customize alert thresholds for your service, add a alerts.yaml file to your deploy/ directory. Ask in #platform-observability for the schema.

Observability

Contents

Observability#

Logging#

Format requirements#

Examples#

Log levels#

Metrics#

Exposing metrics#

Recommended metrics#

Grafana dashboards#

Tracing#

Setup#

Finding traces#

Enabling trace–log correlation#

Alerting#