Troubleshooting ECS EC2 Over-Scaling with Custom Metrics

This article explores unexpected EC2 over-scaling when using ECS with step scaling policies based on a custom CloudWatch metric.

The Problem

When a custom metric triggered task scaling in ECS, the number of EC2 instances scaled up more than expected. For example, one ECS task would require just one EC2 instance, yet two or more were provisioned. Extra instances would eventually be shut down, but only after 40–50 minutes.

Root Causes and Findings

The ECS cluster used a GPU-backed EC2 Auto Scaling Group via a capacity provider.
The step scaling worked correctly at the task level but over-provisioned EC2 instances.
Over-Scaling from Zero: When scaling from zero EC2 instances, ECS often launches two instances by design. This is documented behaviour to prevent cold-start delays. This explained the initial double-scaling when starting from zero instances.
The primary issue was the instanceWarmupPeriod being set to 0 in the ECS capacity provider. This caused ECS to assume EC2 instances were available before they were fully ready, leading to premature scale-outs.
CloudWatch Metric Issues: Metric values were being skewed due to multiple data points in a single period, some reporting zero. The metric was using Stat: Average, which diluted the actual workload signal.
The alarm had a comparison operator set to GreaterThanOrEqualToThreshold, causing EC2 to scale out even when the task count didn't increase.

Implemented Fixes

Warmup Period Fix: The capacity provider was updated to use a instanceWarmupPeriod of 300 seconds (5 minutes). This change aligned EC2 readiness timing with ECS scaling decisions. After this change, launching a task resulted in only one EC2 instance being provisioned, as expected (except for the documented initial scaling from zero).
CloudWatch Metric Fix: Switching to Stat: Sum or analyzing SampleCount helped expose hidden data points. This ensured the scaling alarm responded to real workload demand instead of averaged-down values.
Alarm Threshold Fix: Changing the comparison operator from GreaterThanOrEqualToThreshold to GreaterThanThreshold ensured that scale-out events only occurred when the workload truly required them.

Additional Considerations

Managed Scaling: AWS recommends using the "managed scaling" feature of ECS capacity providers, which automatically handles the relationship between task and instance scaling. This can prevent many of the issues described in this article.
Target Tracking Policies: Consider using target tracking scaling policies instead of step scaling for some use cases. Target tracking maintains a specific metric value (like CPU utilization) and can provide smoother scaling behaviour.
Managed Termination Protection: Check the capacity provider's "managed termination protection" setting, as this can affect instance termination behaviour. When enabled, it prevents the termination of instances that are running tasks.

References

↑ Back to Top