EKS Health Checks — Liveness, Readiness & Monitoring Setup

Kubernetes has three probe types that control how your cluster manages traffic to pods: liveness, readiness, and startup probes. Configuring them correctly on EKS is the difference between zero-downtime deployments and rolling outages that take down production.

This guide covers probe configuration, ALB Ingress Controller health checks, cluster-level monitoring with CloudWatch Container Insights, and how to expose service health externally through a status page.

The three probe types

Liveness probe

A liveness probe tells Kubernetes whether a container is alive. If it fails, the container is killed and restarted. Use it to detect deadlocks or states the container can't recover from on its own.

livenessProbe:
  httpGet:
    path: /api/health/live
    port: 3000
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

The liveness endpoint should be extremely lightweight — just verify the process is alive. Do not check database connectivity in the liveness probe. If the database goes down, you don't want all your pods restarting, which makes the situation worse.

Readiness probe

A readiness probe tells Kubernetes whether a container is ready to accept traffic. If it fails, the pod is removed from service endpoints — traffic is routed elsewhere — but the container is not restarted.

readinessProbe:
  httpGet:
    path: /api/health/ready
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

The readiness endpoint shouldcheck real dependencies — database, cache, external APIs that the pod needs to function. A pod that can't reach its database should stop receiving traffic, but it should not restart (the DB issue isn't a pod problem).

Startup probe

Startup probes buy time for slow-starting containers before liveness checks kick in. Without them, a slow-booting JVM or a service running database migrations on startup will be killed before it's ready.

startupProbe:
  httpGet:
    path: /api/health/live
    port: 3000
  failureThreshold: 30
  periodSeconds: 5

This gives the container up to 150 seconds (30 × 5s) to start before liveness checks begin.

Separate health endpoints by concern

Don't use a single /health endpoint for all three probes. Implement separate paths:

/api/health/live — is the process running? Returns 200 always (unless the process is truly broken).
/api/health/ready — can this pod serve requests? Checks DB, cache, etc. Returns 503 if not ready.
/api/health — comprehensive check for ALB and external monitors. Returns DB status, latency, version info.

ALB Ingress Controller health checks

If you're using the AWS Load Balancer Controller (formerly ALB Ingress Controller) with anIngressresource, the ALB performs its own health checks against each pod's node port. Configure these via annotations:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /api/health/ready
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: "15"
    alb.ingress.kubernetes.io/healthcheck-timeout-seconds: "5"
    alb.ingress.kubernetes.io/healthy-threshold-count: "2"
    alb.ingress.kubernetes.io/unhealthy-threshold-count: "3"
    alb.ingress.kubernetes.io/success-codes: "200"

Use the readiness endpoint for ALB health checks — the same check that Kubernetes uses. This ensures the ALB only routes traffic to pods that Kubernetes also considers ready.

CloudWatch Container Insights

Enable Container Insights to get cluster, node, and pod metrics in CloudWatch automatically:

aws eks update-addon \
  --cluster-name prod \
  --addon-name amazon-cloudwatch-observability \
  --addon-version v1.7.0-eksbuild.1

Key metrics to alarm on:

pod_number_of_running_pods — alert when running pods drop below your desired replica count. Indicates crash-looping or failed rollout.
node_cpu_utilization / node_memory_utilization — node pressure that causes pod evictions and OOMKill events.
pod_cpu_utilization_over_pod_limit — pods exceeding their CPU requests, which causes throttling and latency spikes.

aws cloudwatch put-metric-alarm \
  --alarm-name "eks-running-pods-below-desired" \
  --namespace "ContainerInsights" \
  --metric-name pod_number_of_running_pods \
  --dimensions Name=ClusterName,Value=prod Name=Namespace,Value=default \
  --statistic Average \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 2 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123:alerts

Detecting issues before Kubernetes does

Kubernetes health checks are optimised for container management — they're not designed to detect slow degradation, rising error rates, or partial failures that don't trigger a pod restart.

Pair your EKS health checks with:

External uptime monitor— makes real HTTP requests to your ALB or CloudFront endpoint every 60 seconds, independent of cluster state. Catches DNS and network-level issues the cluster can't see.
ALB 5xx alarm— catches application errors that pods are serving successfully (from Kubernetes' perspective) but are wrong for users.
P99 latency alarm — rising tail latency precedes pod crashes. Alarm onTargetResponseTime p99 > 2s on your ALB target group.

Publishing EKS health externally

When your EKS cluster has an incident — a deployment goes wrong, a node group runs out of capacity, a pod starts crash-looping — your users are affected immediately. Without a status page, they have no way to know whether to wait or contact support.

Add your public ALB or CloudFront URL to PulseRadar as a monitor. When the external check fails three consecutive times, an incident is automatically opened, your team is alerted, and a public status page shows the impact — giving users a self-serve answer and reducing support ticket volume during outages.

EKS health check checklist

Separate /live, /ready, and /health endpoints.
Liveness probe checking process health only (no external dependencies).
Readiness probe checking real dependencies (DB, cache, required APIs).
Startup probe for any service that takes more than 15s to boot.
ALB Ingress annotations aligned with readiness probe path and thresholds.
Container Insights enabled for cluster-level pod count and resource alarms.
External uptime monitor hitting the public ALB/CloudFront URL.
Public status page for user-facing incident communication.

Kubernetes is self-healing by design — but only for failures it can detect. The combination of well-configured probes, cluster-level monitoring, and external uptime checks means you catch problems at every layer, not just the ones the kubelet can see.