How to Monitor ECS Services — Health Checks, Alerts & Uptime

Running containers on Amazon ECS gives you scalability and operational simplicity — but it also moves health monitoring up the stack. With EC2, you knew the machine was alive if SSH worked. With ECS, you have tasks, services, target groups, and application logic all in play simultaneously. Any layer can fail independently.

This guide covers the four levels of ECS monitoring every production team needs, plus how to surface outages to your users through a public status page.

1. Container-level health checks

The first line of defence is the HEALTHCHECK instruction in your Dockerfile or thehealthCheck block in your ECS task definition. ECS uses this to determine whether a running container is actually ready to serve traffic.

# Dockerfile
HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
  CMD curl -f http://localhost:3000/api/health || exit 1

The equivalent in a task definition JSON looks like this:

"healthCheck": {
  "command": ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"],
  "interval": 30,
  "timeout": 5,
  "retries": 3,
  "startPeriod": 15
}

When a container fails its health check three consecutive times, ECS marks the task as UNHEALTHY and replaces it. Setting startPeriod gives your app time to boot before checks begin — crucial for JVM services or apps that run migrations on startup.

Your health endpoint should verify real dependencies — database connectivity, cache reachability — not just return a static 200. A static response means the process is alive, not that the service works.

2. Service-level monitoring with CloudWatch

ECS emits service metrics to CloudWatch automatically. The two most important are:

RunningTaskCount — how many tasks are actually running vs. the desired count.
CPUUtilization / MemoryUtilization — task-level resource pressure that precedes crashes.

Create a CloudWatch alarm that fires when RunningTaskCount drops below your desired count:

aws cloudwatch put-metric-alarm \
  --alarm-name "ecs-running-task-count-low" \
  --namespace "AWS/ECS" \
  --metric-name RunningTaskCount \
  --dimensions Name=ClusterName,Value=prod Name=ServiceName,Value=api \
  --statistic Average \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123:your-alerts-topic

Pair this with a MemoryUtilization > 85% alarm to catch containers approaching OOM before they crash.

3. ALB target group health

If your ECS service sits behind an Application Load Balancer, the target group performs its own HTTP health checks independent of the container health check. A task must pass both to receive traffic.

Monitor the UnHealthyHostCount metric on your target group:

Navigate to EC2 → Load Balancers → your ALB → Target Groups.
Set a CloudWatch alarm when UnHealthyHostCount > 0 for 2 consecutive minutes.
Also monitor HTTPCode_Target_5XX_Count for application-level errors.

A common misconfiguration is setting the health check path to / when your app returns a redirect (302) at root. ALB expects 200–299 by default — configure the expected response codes to match your endpoint.

4. Synthetic external monitoring

CloudWatch alarms and task health checks tell you about internal state. They don't tell you whether your service is reachable from the internet — which is what your users actually experience.

External synthetic monitoring pings your public URL every 60 seconds and alerts you the moment it becomes unreachable, regardless of what ECS reports internally. This catches:

CloudFront misconfigurations that make your site unreachable even though ECS is healthy.
DNS propagation failures after Route 53 changes.
Security group or WAF rules that block external traffic.
SSL certificate issues causing connection errors before TLS handshake.

5. Communicating outages via a status page

Internal alerts are necessary but not sufficient. When your ECS service is down, your users are already impacted. Without a public status page, they have no way to know whether the issue is on their end or yours — so they open support tickets and lose trust.

A status page solves this by giving users a single URL to check. It should show:

Current status of each service (operational, degraded, down).
Active incidents with a timeline of investigation and resolution updates.
Historical uptime bars so users can see your track record.
An email subscription option for proactive notifications.

PulseRadar handles all of this automatically. Add your ECS service health endpoint as a monitor, and when consecutive checks fail, an incident is opened and your subscribers are notified — without any manual intervention from your team.

Summary: ECS monitoring checklist

Container HEALTHCHECK with a real dependency probe (not a static 200).
CloudWatch alarm on RunningTaskCount dropping below desired count.
CloudWatch alarm on MemoryUtilization > 85%.
ALB target group alarm on UnHealthyHostCount > 0.
External synthetic monitor pinging your public URL every 60 seconds.
Public status page with incident management and subscriber notifications.

Combining these layers means you hear about ECS service degradation before your users do — and when something does go wrong, you have the tooling to communicate it clearly and rebuild trust.