# Observability

The platform provides centralized logging, metrics collection, and distributed tracing. All three signals are opt-in per service and configured via the service catalog YAML.

---

## Logging

Logs from all pods are collected by a Fluentd DaemonSet and forwarded to Elasticsearch. You can search and visualize them in [Kibana](https://kibana.mycorp.internal).

### Format requirements

Logs **must** be written to `stdout` in JSON format. The log aggregator uses the following fields:

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `time` | ISO 8601 string | Yes | Timestamp |
| `level` | string | Yes | `debug`, `info`, `warn`, `error` |
| `service` | string | Yes | Must match the service `name` in catalog |
| `message` | string | Yes | Human-readable message |
| `trace_id` | string | No | Trace ID for correlation with tracing |
| `request_id` | string | No | Incoming request ID |

Non-JSON output is still collected but will not be indexed correctly, making it harder to search.

### Examples

::::{tab-set}
:::{tab-item} Go
```go
import "go.uber.org/zap"

logger, _ := zap.NewProduction()
logger.Info("payment processed",
    zap.String("service", "payments"),
    zap.String("order_id", orderID),
    zap.Duration("duration", elapsed),
)
```
:::
:::{tab-item} Python
```python
import logging, json

class JSONFormatter(logging.Formatter):
    def format(self, record):
        return json.dumps({
            "time": self.formatTime(record),
            "level": record.levelname.lower(),
            "service": "your-service",
            "message": record.getMessage(),
        })

handler = logging.StreamHandler()
handler.setFormatter(JSONFormatter())
logging.getLogger().addHandler(handler)
```
:::
:::{tab-item} Node.js
```javascript
const pino = require("pino");
const logger = pino({ base: { service: "your-service" } });
logger.info({ order_id: orderId }, "payment processed");
```
:::
::::

### Log levels

Use log levels consistently:

| Level | When to use |
|-------|-------------|
| `debug` | Detailed diagnostic information. Disabled in production by default. |
| `info` | Normal operation events: request received, job completed. |
| `warn` | Unexpected but recoverable situations: retry triggered, deprecated API used. |
| `error` | Operation failed. Always include the error message and enough context to investigate. |

Do not log sensitive data (passwords, tokens, card numbers, PII). The log aggregator has no automated redaction.

---

## Metrics

The platform uses [Prometheus](https://prometheus.mycorp.internal) for metrics collection and [Grafana](https://grafana.mycorp.internal) for dashboards. Metrics are required for `critical` and `standard` tier services.

### Exposing metrics

Expose a `/metrics` endpoint in [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/), then enable scraping in your `values.yaml`:

```yaml
deployment:
  metrics:
    enabled: true   # the chart adds Prometheus annotations to the Service
    port: 8080
    path: /metrics  # default, can be omitted
```

The chart adds the necessary `prometheus.io/*` annotations to your `Service` resource automatically — you don't need to manage them in manifests.

### Recommended metrics

In addition to the default Go/Python runtime metrics, instrument these signals for every service:

```
http_requests_total{method, path, status}   — request counter
http_request_duration_seconds{method, path} — latency histogram
db_query_duration_seconds{query}            — database query latency
errors_total{type}                          — error counter by type
```

::::{tab-set}
:::{tab-item} Go
```go
var requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "http_request_duration_seconds",
        Help:    "HTTP request latency",
        Buckets: prometheus.DefBuckets,
    },
    []string{"method", "path"},
)

func init() {
    prometheus.MustRegister(requestDuration)
}
```
:::
:::{tab-item} Python
```python
from prometheus_client import Histogram, start_http_server

request_duration = Histogram(
    "http_request_duration_seconds",
    "HTTP request latency",
    ["method", "path"],
)

start_http_server(8080)  # exposes /metrics on port 8080
```
:::
::::

### Grafana dashboards

When you register your service in the catalog with `observability.dashboard` set, that URL appears on your service's catalog page. The platform team can help you create a starter dashboard — ask in [#platform-observability](https://slack.mycorp.internal/platform-observability).

A standard starter dashboard includes:

- Request rate (RPS)
- Error rate (4xx, 5xx)
- Latency percentiles (p50, p95, p99)
- CPU and memory usage
- Pod restarts

---

(tracing)=
## Tracing

Distributed tracing helps you understand the flow of a request across multiple services. The platform uses [Jaeger](https://jaeger.mycorp.internal) as the tracing backend and [OpenTelemetry](https://opentelemetry.io) as the instrumentation standard.

Tracing is optional but strongly recommended for services that make downstream HTTP or gRPC calls.

### Setup

**Step 1 — Install the SDK:**

::::{tab-set}
:::{tab-item} Go
```bash
go get go.opentelemetry.io/otel \
       go.opentelemetry.io/otel/sdk \
       go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
```
:::
:::{tab-item} Python
```bash
pip install opentelemetry-api \
            opentelemetry-sdk \
            opentelemetry-exporter-otlp
```
:::
::::

**Step 2 — Initialize the tracer at startup:**

::::{tab-set}
:::{tab-item} Go
```go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
)

exporter, _ := otlptracegrpc.New(ctx,
    otlptracegrpc.WithEndpoint("otel-collector.platform.svc:4317"),
    otlptracegrpc.WithInsecure(),
)
provider := trace.NewTracerProvider(trace.WithBatcher(exporter))
otel.SetTracerProvider(provider)

tracer := otel.Tracer("your-service")
```
:::
:::{tab-item} Python
```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector.platform.svc:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("your-service")
```
:::
::::

**Step 3 — Propagate trace context in outgoing requests:**

::::{tab-set}
:::{tab-item} Go
```go
import "go.opentelemetry.io/otel/propagation"

// inject into outgoing HTTP request
propagation.TraceContext{}.Inject(ctx, propagation.MapCarrier(req.Header))
resp, err := http.DefaultClient.Do(req)
```
:::
:::{tab-item} Python
```python
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # adds traceparent / tracestate headers
response = httpx.get("http://auth.security.svc/verify", headers=headers)
```
:::
::::

### Finding traces

Open [Jaeger](https://jaeger.mycorp.internal), select your service name from the dropdown, and search by time range or trace ID. Trace IDs are included in log entries as `trace_id` when you configure log–trace correlation.

### Enabling trace–log correlation

Include the current trace ID in your log entries so logs and traces link up in Kibana and Jaeger.

::::{tab-set}
:::{tab-item} Go
```go
import "go.opentelemetry.io/otel/trace"

spanCtx := trace.SpanFromContext(ctx).SpanContext()
logger.Info("processing payment",
    zap.String("trace_id", spanCtx.TraceID().String()),
    zap.String("span_id", spanCtx.SpanID().String()),
)
```
:::
:::{tab-item} Python
```python
from opentelemetry import trace

current_span = trace.get_current_span()
trace_id = format(current_span.get_span_context().trace_id, "032x")

logger.info("processing payment", extra={"trace_id": trace_id})
```
:::
::::

---

## Alerting

Alerts are defined in Prometheus alerting rules and routed to on-call via [PagerDuty](https://pagerduty.mycorp.internal).

The platform provides a set of default alert rules for all services based on the following thresholds:

| Alert | Condition | Severity |
|-------|-----------|----------|
| High error rate | `error_rate > 1%` for 5 min | warning |
| Very high error rate | `error_rate > 5%` for 2 min | critical |
| High latency | `p99 > 2s` for 5 min | warning |
| Pod restarts | `restart_count > 3` in 15 min | warning |
| Pod not ready | No ready pods for 2 min | critical |

To customize alert thresholds for your service, add a `alerts.yaml` file to your `deploy/` directory. Ask in [#platform-observability](https://slack.mycorp.internal/platform-observability) for the schema.
