Kubernetes#
The platform provides a library of Helm charts for common workload types. Instead of writing Kubernetes manifests from scratch, teams reference a chart from the central registry and pass a values.yaml that overrides the defaults.
Charts are maintained by the platform team in a dedicated repository and published to the internal OCI registry at registry.mycorp.internal/charts. Teams do not copy or modify the charts — they only declare a dependency and provide values.
How it works#
flowchart LR
A[Your values.yaml] -->|helm upgrade| B[Platform chart library]
B --> C{Chart type}
C -->|Deployment| D[Deployment + Service + HPA]
C -->|Job| E[Job + ConfigMap]
C -->|CronJob| F[CronJob + ConfigMap]
C -->|StatefulSet| G[StatefulSet + Service + PVC]
D & E & F & G -->|apply| H[Kubernetes cluster]
Reference the chart as a dependency in your service’s Chart.yaml:
# deploy/helm/Chart.yaml
apiVersion: v2
name: your-service
version: 1.0.0
dependencies:
- name: deployment # or: job | cronjob | statefulset
version: ">=1.0.0"
repository: oci://registry.mycorp.internal/charts
Then in your values.yaml, override only what you need. Sensible defaults are provided for everything else.
Chart types#
For long-running services that handle HTTP traffic.
Minimal values.yaml:
deployment:
image:
repository: registry.mycorp.internal/your-service
tag: latest
port: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
Full reference:
deployment:
image:
repository: registry.mycorp.internal/your-service
tag: latest
pullPolicy: IfNotPresent
replicas: 2 # ignored if autoscaling.enabled = true
port: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
healthChecks:
liveness:
path: /health/live
initialDelaySeconds: 15
readiness:
path: /health/ready
initialDelaySeconds: 5
env:
LOG_LEVEL: info
AUTH_SERVICE_URL: http://auth.security.svc.cluster.local
envFromSecret:
- secretName: your-service-secrets
keys: [DATABASE_URL, API_KEY]
metrics:
enabled: true # adds prometheus.io annotations to Service
path: /metrics
ingress:
enabled: false
host: your-service.mycorp.internal
For one-time tasks: database migrations, data imports, cleanup runs.
job:
image:
repository: registry.mycorp.internal/your-service
tag: latest
command: ["python", "manage.py", "migrate"]
resources:
requests:
cpu: 200m
memory: 256Mi
restartPolicy: OnFailure
backoffLimit: 3
# Pass arbitrary config as a mounted ConfigMap
config:
DB_HOST: postgres.fintech.svc.cluster.local
DB_NAME: payments
envFromSecret:
- secretName: your-service-secrets
keys: [DATABASE_URL]
Run a Job manually:
helm dependency update deploy/helm
helm upgrade --install migrate ./deploy/helm \
-n yourteam \
-f deploy/helm/values.yaml \
--set job.image.tag=1.4.2
For scheduled tasks: report generation, cache warming, health probes.
cronjob:
image:
repository: registry.mycorp.internal/reporter
tag: latest
schedule: "0 3 * * *" # daily at 03:00 UTC
command: ["python", "generate_report.py"]
resources:
requests:
cpu: 100m
memory: 128Mi
concurrencyPolicy: Forbid # prevent overlapping runs
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
envFromSecret:
- secretName: reporter-secrets
keys: [S3_BUCKET, AWS_ROLE_ARN]
For stateful workloads: workers that require stable network identities or persistent storage.
statefulset:
image:
repository: registry.mycorp.internal/worker
tag: latest
replicas: 3
port: 9000
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
persistence:
enabled: true
storageClass: standard-ssd
size: 10Gi
mountPath: /data
healthChecks:
liveness:
path: /health/live
initialDelaySeconds: 30
readiness:
path: /health/ready
initialDelaySeconds: 10
Namespaces#
Each team gets a dedicated namespace. Services in different namespaces communicate through the api-gateway or via explicit NetworkPolicy rules.
Team |
Namespace |
|---|---|
team-platform |
|
team-security |
|
team-fintech |
|
team-communications |
|
team-account |
|
Resource guidelines#
Set resources.requests and resources.limits on every workload. The platform enforces resource quotas per namespace.
Tier |
CPU request |
CPU limit |
Memory request |
Memory limit |
|---|---|---|---|---|
|
250m |
1000m |
256Mi |
1Gi |
|
100m |
500m |
128Mi |
256Mi |
|
50m |
200m |
64Mi |
128Mi |
Adjust based on profiling. If a pod is OOMKilled, increase limits.memory. If throttled (visible in the Grafana CPU panel), increase limits.cpu.
Health checks#
The Deployment and StatefulSet charts require two endpoints. Defaults match the values below — override if your service uses different paths.
Endpoint |
Purpose |
Default path |
|---|---|---|
Liveness |
Is the process alive? Kubernetes restarts the pod on failure. |
|
Readiness |
Is the service ready for traffic? Kubernetes stops routing on failure. |
|
Implementation rule: the liveness endpoint must be fast and stateless (no DB calls). The readiness endpoint should check critical dependencies.
Secrets#
Use Vault-injected secrets. Do not put secrets in values.yaml or environment variables in the chart.
Reference secrets in your values:
envFromSecret:
- secretName: your-service-secrets
keys: [DATABASE_URL, API_KEY]
The platform creates the Kubernetes Secret from Vault automatically during deployment. Store secrets at secret/your-service/{key} in Vault.
Troubleshooting#
Pod stuck in Pending:
kubectl describe pod -n yourteam -l app=your-service
Common cause: namespace quota exceeded or missing PVC storage class.
Pod in CrashLoopBackOff:
kubectl logs -n yourteam -l app=your-service --previous
Common causes: missing environment variable, failed DB connection on startup.
ImagePullBackOff:
crane ls registry.mycorp.internal/your-service
Verify the tag exists in the registry and the namespace has pull credentials.