Kubernetes#

The platform provides a library of Helm charts for common workload types. Instead of writing Kubernetes manifests from scratch, teams reference a chart from the central registry and pass a values.yaml that overrides the defaults.

Charts are maintained by the platform team in a dedicated repository and published to the internal OCI registry at registry.mycorp.internal/charts. Teams do not copy or modify the charts — they only declare a dependency and provide values.


How it works#

        flowchart LR
    A[Your values.yaml] -->|helm upgrade| B[Platform chart library]
    B --> C{Chart type}
    C -->|Deployment| D[Deployment + Service + HPA]
    C -->|Job| E[Job + ConfigMap]
    C -->|CronJob| F[CronJob + ConfigMap]
    C -->|StatefulSet| G[StatefulSet + Service + PVC]
    D & E & F & G -->|apply| H[Kubernetes cluster]
    

Reference the chart as a dependency in your service’s Chart.yaml:

# deploy/helm/Chart.yaml
apiVersion: v2
name: your-service
version: 1.0.0
dependencies:
  - name: deployment        # or: job | cronjob | statefulset
    version: ">=1.0.0"
    repository: oci://registry.mycorp.internal/charts

Then in your values.yaml, override only what you need. Sensible defaults are provided for everything else.


Chart types#

For long-running services that handle HTTP traffic.

Minimal values.yaml:

deployment:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest

  port: 8080

  resources:
    requests:
      cpu: 100m
      memory: 128Mi

Full reference:

deployment:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest
    pullPolicy: IfNotPresent

  replicas: 2          # ignored if autoscaling.enabled = true
  port: 8080

  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 256Mi

  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

  healthChecks:
    liveness:
      path: /health/live
      initialDelaySeconds: 15
    readiness:
      path: /health/ready
      initialDelaySeconds: 5

  env:
    LOG_LEVEL: info
    AUTH_SERVICE_URL: http://auth.security.svc.cluster.local

  envFromSecret:
    - secretName: your-service-secrets
      keys: [DATABASE_URL, API_KEY]

  metrics:
    enabled: true        # adds prometheus.io annotations to Service
    path: /metrics

  ingress:
    enabled: false
    host: your-service.mycorp.internal

For one-time tasks: database migrations, data imports, cleanup runs.

job:
  image:
    repository: registry.mycorp.internal/your-service
    tag: latest

  command: ["python", "manage.py", "migrate"]

  resources:
    requests:
      cpu: 200m
      memory: 256Mi

  restartPolicy: OnFailure
  backoffLimit: 3

  # Pass arbitrary config as a mounted ConfigMap
  config:
    DB_HOST: postgres.fintech.svc.cluster.local
    DB_NAME: payments

  envFromSecret:
    - secretName: your-service-secrets
      keys: [DATABASE_URL]

Run a Job manually:

helm dependency update deploy/helm
helm upgrade --install migrate ./deploy/helm \
  -n yourteam \
  -f deploy/helm/values.yaml \
  --set job.image.tag=1.4.2

For scheduled tasks: report generation, cache warming, health probes.

cronjob:
  image:
    repository: registry.mycorp.internal/reporter
    tag: latest

  schedule: "0 3 * * *"    # daily at 03:00 UTC

  command: ["python", "generate_report.py"]

  resources:
    requests:
      cpu: 100m
      memory: 128Mi

  concurrencyPolicy: Forbid     # prevent overlapping runs
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1

  envFromSecret:
    - secretName: reporter-secrets
      keys: [S3_BUCKET, AWS_ROLE_ARN]

For stateful workloads: workers that require stable network identities or persistent storage.

statefulset:
  image:
    repository: registry.mycorp.internal/worker
    tag: latest

  replicas: 3
  port: 9000

  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi

  persistence:
    enabled: true
    storageClass: standard-ssd
    size: 10Gi
    mountPath: /data

  healthChecks:
    liveness:
      path: /health/live
      initialDelaySeconds: 30
    readiness:
      path: /health/ready
      initialDelaySeconds: 10

Namespaces#

Each team gets a dedicated namespace. Services in different namespaces communicate through the api-gateway or via explicit NetworkPolicy rules.

Team

Namespace

team-platform

platform

team-security

security

team-fintech

fintech

team-communications

communications

team-account

account


Resource guidelines#

Set resources.requests and resources.limits on every workload. The platform enforces resource quotas per namespace.

Tier

CPU request

CPU limit

Memory request

Memory limit

critical

250m

1000m

256Mi

1Gi

standard

100m

500m

128Mi

256Mi

internal

50m

200m

64Mi

128Mi

Adjust based on profiling. If a pod is OOMKilled, increase limits.memory. If throttled (visible in the Grafana CPU panel), increase limits.cpu.


Health checks#

The Deployment and StatefulSet charts require two endpoints. Defaults match the values below — override if your service uses different paths.

Endpoint

Purpose

Default path

Liveness

Is the process alive? Kubernetes restarts the pod on failure.

/health/live

Readiness

Is the service ready for traffic? Kubernetes stops routing on failure.

/health/ready

Implementation rule: the liveness endpoint must be fast and stateless (no DB calls). The readiness endpoint should check critical dependencies.


Secrets#

Use Vault-injected secrets. Do not put secrets in values.yaml or environment variables in the chart.

Reference secrets in your values:

envFromSecret:
  - secretName: your-service-secrets
    keys: [DATABASE_URL, API_KEY]

The platform creates the Kubernetes Secret from Vault automatically during deployment. Store secrets at secret/your-service/{key} in Vault.


Troubleshooting#

Pod stuck in Pending:

kubectl describe pod -n yourteam -l app=your-service

Common cause: namespace quota exceeded or missing PVC storage class.

Pod in CrashLoopBackOff:

kubectl logs -n yourteam -l app=your-service --previous

Common causes: missing environment variable, failed DB connection on startup.

ImagePullBackOff:

crane ls registry.mycorp.internal/your-service

Verify the tag exists in the registry and the namespace has pull credentials.