When cost optimisation is a platform design problem

Most cloud cost problems do not start in finance. They start in design.

By the time someone asks for a cost review, the expensive part usually happened months earlier. A workload was sized for convenience. A non-production environment was left running because nobody wanted to automate sleep and wake. A queue consumer was kept alive 24/7 because that felt simpler than admitting it only had work for short bursts. Commitments were bought before anyone had a clean view of what was steady and what was noise.

None of that looks dramatic in the moment. It just slowly becomes normal.

Technical hero illustration of cloud cost optimisation with lifecycle rules, on-demand compute, cost data, and commitment strategy.

That is why I do not treat cost optimisation as a side quest. In a mature estate, it is part of platform design.

Bad cost stories usually begin as convenience

The expensive patterns are rarely sophisticated.

A team wants fast delivery, so a heavy environment stays online all day whether anyone is using it or not. A worker that only needs to run on demand gets bundled into a permanent service. A proof of concept gets promoted into production gravity without ever getting an expiry path.

Each decision makes sense in isolation. Taken together, they produce a platform that is easy to waste money on.

The common thread is not bad intent. It is weak lifecycle design.

Always-on compute is usually the first thing I distrust

A lot of cost work gets framed as a purchasing exercise: tune Savings Plans, review Reserved Instances, shuffle storage classes, maybe delete a few snapshots, and call it optimisation.

That work matters, but it is downstream.

If a workload only needs compute when there is work to do, then keeping it alive all day is not a finance problem. It is an architecture problem. If non-production environments drift into permanent existence, that is not a visibility problem. It is a platform control problem.

Before buying anything, I want to know:

what has to run continuously?
what can run on demand?
what should expire automatically?
what is oversized because nobody trusts the boundary?
what is stable enough for commitments and what is not?

If those answers are fuzzy, the cost model is already compromised.

The data loop matters more than the dashboard

Dashboards are useful. They are not enough.

I want a cost data loop that can answer real questions, not just draw a sad graph.

At minimum, that means:

CUR or equivalent billing export landing somewhere queryable
tagging and account metadata that are good enough to group by owner, environment, and service
a way to separate steady-state cost from temporary bursts
regular review of commitment coverage, utilisation, and waste categories
reporting that points to action, not just surprise

A lightweight query in Athena tells you more than most executive dashboards if the underlying metadata is clean:

SELECT
  line_item_usage_account_id,
  resource_tags_user_environment AS environment,
  product_product_name,
  SUM(line_item_unblended_cost) AS cost
FROM cur_db.cur_table
WHERE bill_billing_period_start_date = DATE '2026-05-01'
GROUP BY 1, 2, 3
ORDER BY cost DESC;

Nothing glamorous there. Good. Cost work should not need glamour.

Cost control needs operating rules

Most teams can spot high spend after it happens. Fewer teams are set up to prevent it cleanly.

That usually requires a few boring but important controls:

account and environment lifecycle rules
expiry or sleep rules for short-lived environments
ownership that is specific enough to act on
guardrails around what can be left running
standard compute shapes instead of endless hand-built snowflakes
automation that makes the cheap path easier than the lazy path

Without that, cost review becomes a monthly ritual where everyone agrees the numbers are bad and then carries on exactly as before.

Non-production is where a lot of money goes to die

This is the bit people keep pretending is minor.

Non-production waste is often structural. Environments exist for too long. Data stores are sized for imaginary peak. Batch jobs keep permanent workers alive. Nobody wants to be the person who turns something off and gets blamed later, so the safest political move is to leave everything on.

That is why I like explicit control patterns:

environment: sandbox
owner: team-search
expires_after: 72h
sleep_schedule: "20:00-07:00 local"
auto_delete: false

That metadata is enough to drive automation. Without metadata, every cost conversation becomes manual archaeology.

Commitment optimisation still matters, but only after the basics

I am not dismissing Savings Plans or Reserved Instances. Managed properly, they can make a real difference.

The catch is simple: commitments work best when the platform is already legible.

If you do not know which workloads are stable, which ones are temporary, and which ones are likely to change shape, commitment purchasing becomes guesswork with a confident spreadsheet wrapped around it. Sometimes that still saves money. Sometimes it just locks in bad assumptions.

My default bias is simple:

clean up obvious waste first
understand the steady-state baseline
commit only against the part of demand you trust
review coverage and utilisation regularly instead of treating the purchase as a victory lap

The best savings are usually operational

The strongest savings I have seen usually came from operational changes, not cosmetic cleanup.

Move a heavy worker into on-demand tasks — the execution model changes that make this practical are covered in modernising long-running workloads with event-driven ECS tasks . Put expiry rules around short-lived environments. Fix ownership. Standardise image lifecycles. Reduce idle EC2 where burst execution is enough. Stop treating non-production as free space where discipline does not matter.

Those changes do more than lower spend. They make the platform easier to run.

That is the part people miss. Good cost optimisation is rarely about being cheap. It is about building a platform where waste is not the default setting.

Final point

If a cloud estate is expensive, the useful question is not “what can we turn off this month?” It is “what have we designed badly enough that waste became normal?”

That question usually leads to better boundaries, better lifecycle rules, and lower spend at the same time.

Bad cost stories usually begin as convenience#

Always-on compute is usually the first thing I distrust#

The data loop matters more than the dashboard#

Cost control needs operating rules#

Non-production is where a lot of money goes to die#

Commitment optimisation still matters, but only after the basics#

The best savings are usually operational#

Final point#