Cost Optimisation in ECS: Integrating Spot Instances at Scale

At Deliveroo, we’re always refining how we scale - especially when it comes to managing compute costs in the cloud. After optimising our Amazon ECS workloads with Reserved Instances and Savings Plans, we saw an opportunity to push further using EC2 Spot Instances, which offer up to 90% savings compared to On-Demand prices. But Spot comes with challenges: Their availability can fluctuate, and they can be terminated with just a two-minute warning.

To unlock these savings without compromising service stability, we had to engineer a robust solution across infrastructure, workload qualification, and automation.

The Challenge: Balancing Cost and Reliability

Our ECS infrastructure initially relied entirely on On-Demand EC2 instances, provisioned through Auto Scaling Groups (ASGs) connected to ECS Capacity Providers. While reliable, this approach didn’t take advantage of AWS’s surplus compute capacity.

We aimed to layer Spot Instances into our clusters, but selectively. Our goal was clear: route only eligible workloads to Spot capacity while ensuring no service degradation during unexpected terminations.

Spot Instances: Power and Pitfalls

Spot Instances provide dramatic cost reductions but introduce several operational caveats:

To avoid introducing fragility into our stack, we established technical eligibility criteria that workloads must meet before being scheduled on Spot.

Defining Spot Eligibility

We formalised the following criteria to assess whether a workload could safely tolerate Spot interruptions:

These criteria allowed us to systematically identify workloads that can tolerate interruptions without user impact.

Introducing the Spot Placement Score (SPS) Metric

To make smarter placement decisions, we introduced a Spot Placement Score (SPS) metric. This metric reflects how likely we are to acquire Spot Instances for a specific configuration and is based on:

We derive this score using AWS’s Spot Placement Score API, which provides insight into AWS’s internal view of instance availability. We calculate a custom SPS for each cluster configuration and emit it as a metric to track Spot capacity health over time.

Automating Spot Adjustments with Fine-Grained Controls

To manage Spot usage dynamically, we built an automation around this score to respond to SPS in near real-time. Our system enables:

This is achieved via a global “Spot percentage” setting in ECS Capacity Providers, which controls the ratio of tasks placed on Spot versus On-Demand (e.g. 70/30). Adjusting this percentage allows us to dial in or out of Spot with minimal effort.

Scalable Architecture: Modular Automation

To scale this across environments, we built a resilient, self-healing system to continuously evaluate and apply Spot configurations. Key components include:

We leverage ECS capacity provider strategies to control spot usage per service. These strategies define how tasks are distributed across different capacity providers, such as On-Demand and Spot by assigning weights. For example 70% on-demand, 30% spot and ECS uses this to decide where tasks should be placed.

Per-Service Granularity

Rather than enforce a blanket rule across all services, we opted for per-service strategy definitions. This enables us to:

This granularity gives teams confidence in adoption without compromising uptime.

Real-World Impact

Even with a cautious 5-10% Spot adoption per cluster, we’ve achieved meaningful compute cost reductions without any observable impact on service availability or latency.

As our internal confidence, observability, and tooling mature, we’ll progressively increase this percentage and expand spot adoption without sacrificing safety.

Key Takeaways

If you’re managing a large ECS footprint, Spot Instances can be a major lever for compute cost optimisation—but only with the right guardrails in place. With objective eligibility criteria, an automation framework, and real-time placement scoring, Spot can become a safe, reliable, and cost-effective part of your ECS strategy.

We’re hiring in Tech, if you are interested in a position at Deliveroo, please visit deliveroo.com/careers.


About Aakash Singhal

A picture of Aakash Singhal

Aakash Singhal is a Senior Platform Engineer at Deliveroo with around a decade of experience in cloud infrastructure and platform engineering. He’s part of the Infrastructure team, where he focuses on ECS capacity management, platform reliability, and deployment efficiency. Aakash is passionate about DevOps, automation, simplifying complex systems, and all things Linux.