A lot of legacy platform pain comes from workloads that ended up in the wrong runtime and then got comfortable there.

They start as helpers, crawlers, distributors, report builders, importers, or background jobs bolted onto the main stack because it is easy. Then they grow teeth. They hold memory, run too long, compete with the application, and force the platform to scale around their worst day instead of the application’s normal day.

At that point the question is not “how do we tune this thing?” It is “why is this thing still living here?”

The smell is usually obvious

You can spot the wrong-fit workloads pretty quickly.

They usually have some combination of these traits:

  • they run much longer than the web or API tier next to them
  • they need different CPU or memory sizing
  • they care about retry semantics more than request latency
  • they can tolerate queueing, but not shared-runtime interference
  • they generate noisy failure modes that muddy application observability

If that sounds familiar, you are probably not looking at an application concern anymore. You are looking at an execution-boundary problem.

Shared runtime models get expensive twice

When unrelated workloads share the same runtime model, you pay twice.

First you pay in infrastructure. The service gets sized for the heavy worker instead of the ordinary request path. That pushes everyone toward always-on compute, fatter instances, and bad compromise sizing.

Then you pay in operations. Deployments get riskier. Logs get noisier. Scaling signals get less honest. A problem in the worker can look like an application issue, and an application deployment can accidentally disturb the worker.

That is why some modernisation work is less about rewriting code and more about correcting where the code runs.

The target shape is usually simple

For bursty or operationally distinct workloads, I usually want a shape closer to this:

producer / scheduler / API event
        -> SQS queue or EventBridge event
        -> thin launcher (optional)
        -> ECS RunTask on Fargate or EC2 capacity
        -> task-specific logs, metrics, and exit status
        -> retry / DLQ / alerting path

The exact trigger depends on the workload.

  • Scheduled work can start from EventBridge Scheduler.
  • Queued work usually fits SQS with a launcher or consumer pattern.
  • Integration events may start directly from EventBridge.
  • Batch orchestration may need Step Functions if the workflow actually has state.

The point is not to force everything into one AWS product. The point is to stop pretending a long-running side process belongs inside the main service forever.

Event-driven ECS tasks are a good fit when the work is discrete

The model gets cleaner when each unit of work has a natural boundary.

A task receives input, does the work, emits logs and metrics, and exits. That gives you immediate gains:

  • compute exists only when there is work
  • scaling follows demand more honestly
  • CPU and memory sizing can match the worker instead of the web tier
  • failures are isolated to task runs instead of poisoning a permanent service
  • deployments become less entangled

That does not make the workload good by magic. It just gives it an execution model that tells the truth.

The implementation details matter

This pattern gets messy when teams treat task launch as the architecture instead of one part of it.

What actually matters:

1. Idempotency

If an event retries, the task cannot blindly do the work twice unless that is safe. You need a job key, state check, or dedupe mechanism somewhere.

2. Concurrency limits

If the trigger path can flood the queue, you need explicit control over concurrency. Otherwise “event driven” turns into “self-inflicted denial of wallet”.

3. Timeouts and retries

Task-level retries are not enough on their own. Decide what retries the queue handles, what the launcher handles, and what the worker handles.

4. Payload design

Do not jam large blobs into the event body if a stable reference will do. Pass object keys, record IDs, or job descriptors instead.

5. Observability

A disposable task still needs first-class logs, metrics, correlation IDs, and failure alerts. Otherwise it just disappears in a more modern way.

A practical task definition is boring on purpose

I want the task definition to be explicit and predictable. For example:

{
  "family": "crawler-job",
  "networkMode": "awsvpc",
  "cpu": "1024",
  "memory": "2048",
  "requiresCompatibilities": ["FARGATE"],
  "containerDefinitions": [
    {
      "name": "crawler",
      "image": "123456789012.dkr.ecr.ap-southeast-2.amazonaws.com/crawler:2026-05-20",
      "essential": true,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/crawler-job",
          "awslogs-region": "ap-southeast-2",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "environment": [
        {"name": "JOB_MODE", "value": "crawl"}
      ]
    }
  ]
}

Nothing exotic there. Good. It should be boring.

The value comes from the boundary around it: how it is triggered, how concurrency is controlled, how it reports status, and how it fails.

A thin launcher is often worth it

People sometimes want to wire every event straight into ECS. Sometimes that is fine. Sometimes it is lazy.

A thin launcher function or service can do useful work before RunTask:

  • validate the payload
  • reject duplicate work
  • enrich the task environment
  • choose task size based on job type
  • cap concurrency
  • route failures to a dead-letter path

If you need those behaviours, add the launcher. If you do not, skip it. The point is to be deliberate.

Cost improvement is real, but it is not automatic

Yes, event-driven tasks can reduce waste. They also make it easier to isolate heavy workloads and stop paying for idle compute.

But this is where people lie to themselves.

If the task is noisy, retried badly, oversized, or triggered by junk events, the cost story gets ugly fast. If cost is already a structural problem on your platform before you get to execution models, that is a different starting point — cost optimisation as a platform design problem covers the runtime and lifecycle habits behind it. You still need queue hygiene, sane timeouts, and realistic task sizing. “Serverless-ish” does not mean cheap by default.

Final point

A lot of modernisation work gets framed as code transformation. Sometimes the bigger win is operational shape.

If a workload is long-running, resource-hungry, and constantly awkward inside the main stack, there is a good chance it should not live there anymore. Moving it into event-driven ECS tasks will not solve every problem, but it often gives you cleaner scaling, cleaner failure boundaries, and far fewer arguments with your own platform.