Kubernetes

Kubernetes#

The platform provides a library of Helm charts for common workload types. Instead of writing Kubernetes manifests from scratch, teams reference a chart from the central registry and pass a values.yaml that overrides the defaults.

Charts are maintained by the platform team in a dedicated repository and published to the internal OCI registry at registry.mycorp.internal/charts. Teams do not copy or modify the charts — they only declare a dependency and provide values.

How it works#

        flowchart LR
    A[Your values.yaml] -->|helm upgrade| B[Platform chart library]
    B --> C{Chart type}
    C -->|Deployment| D[Deployment + Service + HPA]
    C -->|Job| E[Job + ConfigMap]
    C -->|CronJob| F[CronJob + ConfigMap]
    C -->|StatefulSet| G[StatefulSet + Service + PVC]
    D & E & F & G -->|apply| H[Kubernetes cluster]

Reference the chart as a dependency in your service’s Chart.yaml:

# deploy/helm/Chart.yaml
apiVersion: v2
name: your-service
version: 1.0.0
dependencies:
  - name: deployment        # or: job | cronjob | statefulset
    version: ">=1.0.0"
    repository: oci://registry.mycorp.internal/charts

Then in your values.yaml, override only what you need. Sensible defaults are provided for everything else.

Chart types#

Deployment

For long-running services that handle HTTP traffic.

Minimal values.yaml:

deployment:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest

  port: 8080

  resources:
    requests:
      cpu: 100m
      memory: 128Mi

Full reference:

deployment:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest
    pullPolicy: IfNotPresent

  replicas: 2          # ignored if autoscaling.enabled = true
  port: 8080

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 256Mi

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

  healthChecks:
    liveness:
      path: /health/live
      initialDelaySeconds: 15
    readiness:
      path: /health/ready
      initialDelaySeconds: 5

  env:
    LOG_LEVEL: info
    AUTH_SERVICE_URL: http://auth.security.svc.cluster.local

  envFromSecret:
    - secretName: your-service-secrets
      keys: [DATABASE_URL, API_KEY]

  metrics:
    enabled: true        # adds prometheus.io annotations to Service
    path: /metrics

  ingress:
    enabled: false
    host: your-service.mycorp.internal

Job

For one-time tasks: database migrations, data imports, cleanup runs.

job:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest

  command: ["python", "manage.py", "migrate"]

  resources:
    requests:
      cpu: 200m
      memory: 256Mi

  restartPolicy: OnFailure
  backoffLimit: 3

  # Pass arbitrary config as a mounted ConfigMap
  config:
    DB_HOST: postgres.fintech.svc.cluster.local
    DB_NAME: payments

  envFromSecret:
    - secretName: your-service-secrets
      keys: [DATABASE_URL]

Run a Job manually:

helm dependency update deploy/helm
helm upgrade --install migrate ./deploy/helm \
  -n yourteam \
  -f deploy/helm/values.yaml \
  --set job.image.tag=1.4.2

CronJob

For scheduled tasks: report generation, cache warming, health probes.

cronjob:
  image:
    repository: registry.mycorp.internal/reporter
    tag: latest

  schedule: "0 3 * * *"    # daily at 03:00 UTC

  command: ["python", "generate_report.py"]

  resources:
    requests:
      cpu: 100m
      memory: 128Mi

  concurrencyPolicy: Forbid     # prevent overlapping runs
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1

  envFromSecret:
    - secretName: reporter-secrets
      keys: [S3_BUCKET, AWS_ROLE_ARN]

StatefulSet

For stateful workloads: workers that require stable network identities or persistent storage.

statefulset:
  image:
    repository: registry.mycorp.internal/worker
    tag: latest

  replicas: 3
  port: 9000

  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

  persistence:
    enabled: true
    storageClass: standard-ssd
    size: 10Gi
    mountPath: /data

  healthChecks:
    liveness:
      path: /health/live
      initialDelaySeconds: 30
    readiness:
      path: /health/ready
      initialDelaySeconds: 10

Namespaces#

Each team gets a dedicated namespace. Services in different namespaces communicate through the api-gateway or via explicit NetworkPolicy rules.

Team	Namespace
team-platform	`platform`
team-security	`security`
team-fintech	`fintech`
team-communications	`communications`
team-account	`account`

Resource guidelines#

Set resources.requests and resources.limits on every workload. The platform enforces resource quotas per namespace.

Tier	CPU request	CPU limit	Memory request	Memory limit
`critical`	250m	1000m	256Mi	1Gi
`standard`	100m	500m	128Mi	256Mi
`internal`	50m	200m	64Mi	128Mi

Adjust based on profiling. If a pod is OOMKilled, increase limits.memory. If throttled (visible in the Grafana CPU panel), increase limits.cpu.

Health checks#

The Deployment and StatefulSet charts require two endpoints. Defaults match the values below — override if your service uses different paths.

Endpoint	Purpose	Default path
Liveness	Is the process alive? Kubernetes restarts the pod on failure.	`/health/live`
Readiness	Is the service ready for traffic? Kubernetes stops routing on failure.	`/health/ready`

Implementation rule: the liveness endpoint must be fast and stateless (no DB calls). The readiness endpoint should check critical dependencies.

Secrets#

Use Vault-injected secrets. Do not put secrets in values.yaml or environment variables in the chart.

Reference secrets in your values:

envFromSecret:
  - secretName: your-service-secrets
    keys: [DATABASE_URL, API_KEY]

The platform creates the Kubernetes Secret from Vault automatically during deployment. Store secrets at secret/your-service/{key} in Vault.

Troubleshooting#

Pod stuck in Pending:

kubectl describe pod -n yourteam -l app=your-service

Common cause: namespace quota exceeded or missing PVC storage class.

Pod in CrashLoopBackOff:

kubectl logs -n yourteam -l app=your-service --previous

Common causes: missing environment variable, failed DB connection on startup.

ImagePullBackOff:

crane ls registry.mycorp.internal/your-service

Verify the tag exists in the registry and the namespace has pull credentials.