Observability#

The platform provides centralized logging, metrics collection, and distributed tracing. All three signals are opt-in per service and configured via the service catalog YAML.


Logging#

Logs from all pods are collected by a Fluentd DaemonSet and forwarded to Elasticsearch. You can search and visualize them in Kibana.

Format requirements#

Logs must be written to stdout in JSON format. The log aggregator uses the following fields:

Field

Type

Required

Description

time

ISO 8601 string

Yes

Timestamp

level

string

Yes

debug, info, warn, error

service

string

Yes

Must match the service name in catalog

message

string

Yes

Human-readable message

trace_id

string

No

Trace ID for correlation with tracing

request_id

string

No

Incoming request ID

Non-JSON output is still collected but will not be indexed correctly, making it harder to search.

Examples#

import "go.uber.org/zap"

logger, _ := zap.NewProduction()
logger.Info("payment processed",
    zap.String("service", "payments"),
    zap.String("order_id", orderID),
    zap.Duration("duration", elapsed),
)
import logging, json

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "time": self.formatTime(record),
            "level": record.levelname.lower(),
            "service": "your-service",
            "message": record.getMessage(),
        })

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)
const pino = require("pino");
const logger = pino({ base: { service: "your-service" } });
logger.info({ order_id: orderId }, "payment processed");

Log levels#

Use log levels consistently:

Level

When to use

debug

Detailed diagnostic information. Disabled in production by default.

info

Normal operation events: request received, job completed.

warn

Unexpected but recoverable situations: retry triggered, deprecated API used.

error

Operation failed. Always include the error message and enough context to investigate.

Do not log sensitive data (passwords, tokens, card numbers, PII). The log aggregator has no automated redaction.


Metrics#

The platform uses Prometheus for metrics collection and Grafana for dashboards. Metrics are required for critical and standard tier services.

Exposing metrics#

Expose a /metrics endpoint in Prometheus exposition format, then enable scraping in your values.yaml:

deployment:
  metrics:
    enabled: true   # the chart adds Prometheus annotations to the Service
    port: 8080
    path: /metrics  # default, can be omitted

The chart adds the necessary prometheus.io/* annotations to your Service resource automatically — you don’t need to manage them in manifests.

Grafana dashboards#

When you register your service in the catalog with observability.dashboard set, that URL appears on your service’s catalog page. The platform team can help you create a starter dashboard — ask in #platform-observability.

A standard starter dashboard includes:

  • Request rate (RPS)

  • Error rate (4xx, 5xx)

  • Latency percentiles (p50, p95, p99)

  • CPU and memory usage

  • Pod restarts


Tracing#

Distributed tracing helps you understand the flow of a request across multiple services. The platform uses Jaeger as the tracing backend and OpenTelemetry as the instrumentation standard.

Tracing is optional but strongly recommended for services that make downstream HTTP or gRPC calls.

Setup#

Step 1 — Install the SDK:

go get go.opentelemetry.io/otel \
       go.opentelemetry.io/otel/sdk \
       go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp

Step 2 — Initialize the tracer at startup:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

exporter, _ := otlptracegrpc.New(ctx,
    otlptracegrpc.WithEndpoint("otel-collector.platform.svc:4317"),
    otlptracegrpc.WithInsecure(),
)
provider := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(provider)

tracer := otel.Tracer("your-service")
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector.platform.svc:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("your-service")

Step 3 — Propagate trace context in outgoing requests:

import "go.opentelemetry.io/otel/propagation"

// inject into outgoing HTTP request
propagation.TraceContext{}.Inject(ctx, propagation.MapCarrier(req.Header))
resp, err := http.DefaultClient.Do(req)
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # adds traceparent / tracestate headers
response = httpx.get("http://auth.security.svc/verify", headers=headers)

Finding traces#

Open Jaeger, select your service name from the dropdown, and search by time range or trace ID. Trace IDs are included in log entries as trace_id when you configure log–trace correlation.

Enabling trace–log correlation#

Include the current trace ID in your log entries so logs and traces link up in Kibana and Jaeger.

import "go.opentelemetry.io/otel/trace"

spanCtx := trace.SpanFromContext(ctx).SpanContext()
logger.Info("processing payment",
    zap.String("trace_id", spanCtx.TraceID().String()),
    zap.String("span_id", spanCtx.SpanID().String()),
)
from opentelemetry import trace

current_span = trace.get_current_span()
trace_id = format(current_span.get_span_context().trace_id, "032x")

logger.info("processing payment", extra={"trace_id": trace_id})

Alerting#

Alerts are defined in Prometheus alerting rules and routed to on-call via PagerDuty.

The platform provides a set of default alert rules for all services based on the following thresholds:

Alert

Condition

Severity

High error rate

error_rate > 1% for 5 min

warning

Very high error rate

error_rate > 5% for 2 min

critical

High latency

p99 > 2s for 5 min

warning

Pod restarts

restart_count > 3 in 15 min

warning

Pod not ready

No ready pods for 2 min

critical

To customize alert thresholds for your service, add a alerts.yaml file to your deploy/ directory. Ask in #platform-observability for the schema.