# Kubernetes

The platform provides a library of Helm charts for common workload types. Instead of writing Kubernetes manifests from scratch, teams reference a chart from the central registry and pass a `values.yaml` that overrides the defaults.

Charts are maintained by the platform team in a dedicated repository and published to the internal OCI registry at `registry.mycorp.internal/charts`. Teams do not copy or modify the charts — they only declare a dependency and provide values.

---

## How it works

```{mermaid}
flowchart LR
    A[Your values.yaml] -->|helm upgrade| B[Platform chart library]
    B --> C{Chart type}
    C -->|Deployment| D[Deployment + Service + HPA]
    C -->|Job| E[Job + ConfigMap]
    C -->|CronJob| F[CronJob + ConfigMap]
    C -->|StatefulSet| G[StatefulSet + Service + PVC]
    D & E & F & G -->|apply| H[Kubernetes cluster]
```

Reference the chart as a dependency in your service's `Chart.yaml`:

```yaml
# deploy/helm/Chart.yaml
apiVersion: v2
name: your-service
version: 1.0.0
dependencies:
  - name: deployment        # or: job | cronjob | statefulset
    version: ">=1.0.0"
    repository: oci://registry.mycorp.internal/charts
```

Then in your `values.yaml`, override only what you need. Sensible defaults are provided for everything else.

---

## Chart types

::::{tab-set}

:::{tab-item} Deployment

For long-running services that handle HTTP traffic.

**Minimal values.yaml:**

```yaml
deployment:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest

  port: 8080

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
```

**Full reference:**

```yaml
deployment:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest
    pullPolicy: IfNotPresent

  replicas: 2          # ignored if autoscaling.enabled = true
  port: 8080

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 256Mi

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

  healthChecks:
    liveness:
      path: /health/live
      initialDelaySeconds: 15
    readiness:
      path: /health/ready
      initialDelaySeconds: 5

  env:
    LOG_LEVEL: info
    AUTH_SERVICE_URL: http://auth.security.svc.cluster.local

  envFromSecret:
    - secretName: your-service-secrets
      keys: [DATABASE_URL, API_KEY]

  metrics:
    enabled: true        # adds prometheus.io annotations to Service
    path: /metrics

  ingress:
    enabled: false
    host: your-service.mycorp.internal
```

:::

:::{tab-item} Job

For one-time tasks: database migrations, data imports, cleanup runs.

```yaml
job:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest

  command: ["python", "manage.py", "migrate"]

  resources:
    requests:
      cpu: 200m
      memory: 256Mi

  restartPolicy: OnFailure
  backoffLimit: 3

  # Pass arbitrary config as a mounted ConfigMap
  config:
    DB_HOST: postgres.fintech.svc.cluster.local
    DB_NAME: payments

  envFromSecret:
    - secretName: your-service-secrets
      keys: [DATABASE_URL]
```

Run a Job manually:

```bash
helm dependency update deploy/helm
helm upgrade --install migrate ./deploy/helm \
  -n yourteam \
  -f deploy/helm/values.yaml \
  --set job.image.tag=1.4.2
```

:::

:::{tab-item} CronJob

For scheduled tasks: report generation, cache warming, health probes.

```yaml
cronjob:
  image:
    repository: registry.mycorp.internal/reporter
    tag: latest

  schedule: "0 3 * * *"    # daily at 03:00 UTC

  command: ["python", "generate_report.py"]

  resources:
    requests:
      cpu: 100m
      memory: 128Mi

  concurrencyPolicy: Forbid     # prevent overlapping runs
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1

  envFromSecret:
    - secretName: reporter-secrets
      keys: [S3_BUCKET, AWS_ROLE_ARN]
```

:::

:::{tab-item} StatefulSet

For stateful workloads: workers that require stable network identities or persistent storage.

```yaml
statefulset:
  image:
    repository: registry.mycorp.internal/worker
    tag: latest

  replicas: 3
  port: 9000

  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

  persistence:
    enabled: true
    storageClass: standard-ssd
    size: 10Gi
    mountPath: /data

  healthChecks:
    liveness:
      path: /health/live
      initialDelaySeconds: 30
    readiness:
      path: /health/ready
      initialDelaySeconds: 10
```

:::

::::

---

## Namespaces

Each team gets a dedicated namespace. Services in different namespaces communicate through the [api-gateway](catalog/generated/api-gateway.md) or via explicit NetworkPolicy rules.

| Team | Namespace |
|------|-----------|
| team-platform | `platform` |
| team-security | `security` |
| team-fintech | `fintech` |
| team-communications | `communications` |
| team-account | `account` |

---

## Resource guidelines

Set `resources.requests` and `resources.limits` on every workload. The platform enforces resource quotas per namespace.

| Tier | CPU request | CPU limit | Memory request | Memory limit |
|------|-------------|-----------|----------------|--------------|
| `critical` | 250m | 1000m | 256Mi | 1Gi |
| `standard` | 100m | 500m | 128Mi | 256Mi |
| `internal` | 50m | 200m | 64Mi | 128Mi |

Adjust based on profiling. If a pod is `OOMKilled`, increase `limits.memory`. If throttled (visible in the Grafana CPU panel), increase `limits.cpu`.

---

(health-checks)=
## Health checks

The `Deployment` and `StatefulSet` charts require two endpoints. Defaults match the values below — override if your service uses different paths.

| Endpoint | Purpose | Default path |
|----------|---------|--------------|
| Liveness | Is the process alive? Kubernetes restarts the pod on failure. | `/health/live` |
| Readiness | Is the service ready for traffic? Kubernetes stops routing on failure. | `/health/ready` |

**Implementation rule:** the liveness endpoint must be fast and stateless (no DB calls). The readiness endpoint should check critical dependencies.

---

## Secrets

Use Vault-injected secrets. Do not put secrets in `values.yaml` or environment variables in the chart.

Reference secrets in your values:

```yaml
envFromSecret:
  - secretName: your-service-secrets
    keys: [DATABASE_URL, API_KEY]
```

The platform creates the Kubernetes Secret from Vault automatically during deployment. Store secrets at `secret/your-service/{key}` in Vault.

---

## Troubleshooting

**Pod stuck in `Pending`:**
```bash
kubectl describe pod -n yourteam -l app=your-service
```
Common cause: namespace quota exceeded or missing PVC storage class.

**Pod in `CrashLoopBackOff`:**
```bash
kubectl logs -n yourteam -l app=your-service --previous
```
Common causes: missing environment variable, failed DB connection on startup.

**`ImagePullBackOff`:**
```bash
crane ls registry.mycorp.internal/your-service
```
Verify the tag exists in the registry and the namespace has pull credentials.
