Selected work · case studies · public proof

Public-facing proof for the kind of AWS platform problems I solve.

This is the sanitized version of the work: specific enough to show judgment, stripped back enough to stay public-safe, and written for buyers who care more about outcomes than buzzwords.

Scope Governance, cost discipline, workload modernisation, reliability, and platform operating model cleanup.
Pattern The underlying problem is usually not lack of tooling. It is blurry ownership, weak runtime defaults, or controls in the wrong layer.

How to read these

These are not polished marketing fairy tales.

They are short narrative slices: the environment shape, the real failure mode, what changed, and why the result mattered. The names are softened where they need to be, but the engineering pattern is real.

Flagship case study · governance operating model

Governance and compliance execution in regulated AWS estates

The visible issue was compliance drift. The underlying failure was that controls, ownership, and reporting had decoupled from each other, so the estate looked governed on paper while behaving ambiguously in practice.

Environment shape Multi-account AWS estate in a regulated setting with platform, security, and delivery teams all touching the control surface.
Failure pattern Preventive controls were blunt, detective controls were noisy, and exceptions lived in tickets, memory, and side conversations.
Actual intervention Reworked the operating model around guardrails, reporting, escalation paths, and ownership boundaries so findings could land somewhere actionable.

This kind of estate usually does not fail because nobody cares about governance. It fails because the control system has become detached from runtime reality. Findings exist, dashboards exist, owners exist in theory, but the path from “something drifted” to “someone corrects it” is full of ambiguity.

Situation

The estate had the classic shape: multiple accounts, regulated requirements, a growing stack of guardrails, detective findings, and reporting outputs. On the surface that looks mature. Under load it often behaves like three overlapping systems that never fully agree with each other.

Problem beneath the problem

The main issue was not a missing policy or a missing dashboard. It was operating-boundary confusion. Platform owned some controls, security interpreted others, delivery teams triggered the underlying findings, and exceptions passed between all three without a clean accountability path.

Why the old model stalled

Preventive controls were often too broad to be useful, detective controls generated noise without enough routing context, and evidence collection made audit conversations possible without making remediation reliable. The system was legible to governance people but not operable for delivery teams.

What changed

The work moved the control surface closer to an actual operating model: tighten where preventive controls belong, route detective findings to named owners, make exceptions explicit instead of tribal, and align reporting with accountable action rather than with abstract status visibility.

Before · paper-compliant, operationally blurry
Control fires

A preventive or detective control trips, but context is thin and the system does not distinguish well between breach, exception, and noise.

Finding lands in reporting

Dashboards and reports make drift visible, but they do not make the next owner or next action obvious.

Ownership diffuses

Platform, security, and delivery each see a partial problem, so findings bounce or sit while exception logic lives in tickets and memory.

After · control behaviour tied to owners
Control intent clarified

Preventive controls are used where they genuinely protect boundaries; detective controls are tuned around actionable routing rather than raw volume.

Findings map to accountable owners

Reporting carries enough context that the issue lands with the team responsible for the boundary, workload, or exception path.

Evidence supports action

The estate becomes easier to run because compliance evidence, escalation, and remediation now reinforce each other instead of competing.

Narrative spine 01 Strip out the fake maturity signal.

Having reports is not the same as having a governable platform. The first move was separating visibility from actionability.

Narrative spine 02 Put control decisions in the right layer.

Some problems belonged in guardrails, some in reporting, and some in operating boundaries between teams. Treating them as one category was part of the failure.

Narrative spine 03 Leave behind something teams can actually run.

The result needed to survive after the workshop or audit cycle ended. That meant explicit routing, explicit ownership, and less dependence on memory.

What improved Stronger compliance execution, clearer escalation, and fewer dead-end findings.
Why buyers care This is the difference between a platform that merely reports risk and one that can reliably absorb governance requirements without theatre.
What it says about my work I am usually most useful where the technical control surface and the operating model have drifted apart.

Outcome: stronger compliance execution and clearer control behaviour without adding more governance theatre. The real win was not another layer of reporting. It was reducing the gap between architecture intent, control evidence, and the people who actually had to act.

Case study 02

Cloud cost control with engineering teeth

The spend issue looked financial from a distance. Up close, it was a platform design and runtime behaviour problem.

Situation

Cost pressure was visible, but the estate had already normalised a set of habits that made waste feel structural: always-on runtime assumptions, weak environment lifecycle discipline, and workloads placed or shaped in ways that quietly accumulated expense.

Problem beneath the problem

Dashboards existed. Visibility was not the limiting factor. The real issue was that cost ownership had drifted away from engineering decisions, so expensive behaviour kept being designed back into the platform.

What changed

Focused on the runtime and lifecycle habits behind the spend: commitment optimisation, utilisation discipline, environment scheduling, and workload reshaping where the execution model itself was wasteful.

Why it mattered

This was useful because the result survived after the reporting conversation ended. The saving came from engineering change, not a better meeting about cost.

Outcome: material annual savings driven by engineering change, not just reporting.

Case study 03

Modernising long-running workloads into cleaner ECS task patterns

The workload technically ran. It just ran in a shape that was expensive, awkward to scale, and harder to reason about than it needed to be.

Situation

Some workloads stay alive long after their original execution model stopped making sense. They become always-on by habit, consume more resources than needed, and create operational complexity because their trigger, worker, and failure paths were never cleanly separated.

Problem beneath the problem

The issue was not just efficiency. The workload shape itself was wrong for the behaviour being asked of it. That meant isolation, retries, scaling behaviour, and failure handling were all carrying unnecessary ambiguity.

What changed

Moved the execution pattern toward event-driven ECS tasks with clearer separation between trigger, worker, retry, and dead-letter paths. Reduced always-on assumptions and made the workload easier to scale and reason about under uneven demand.

Why it mattered

Better architecture here meant lower operational drag, clearer failure handling, and a platform shape that was more aligned with the actual runtime behaviour of the system.

Outcome: better scaling, cleaner failure handling, and less operational drag.

Case study 04

Reliability remediation for a public-facing digital platform

The visible symptom was repeated dropouts. The real work was tracing the failure pattern through architecture and performance behaviour until the outage loop stopped.

Situation

When a public-facing platform drops out repeatedly, everyone feels the pain but the remediation often gets fragmented across layers. One team sees hosting, another sees application behaviour, another sees traffic, and the actual outage pattern survives all of them.

Problem beneath the problem

The main risk in this kind of work is chasing symptoms. If the investigation stays too shallow, teams end up stabilising around the outage instead of removing it.

What changed

Worked through the real failure path rather than surface explanations, addressing the architecture and performance behaviour driving the instability. The job was not to make the graphs prettier. It was to stop the user-visible failure from recurring.

Why it mattered

Reliability work is only meaningful if the outage pattern actually disappears. In this case it did.

Outcome: critical website dropouts reduced from roughly 20 per day to zero for Australian Museum.

How I usually approach this work

The recurring pattern is boring in a good way.

Find the boundary problem. Find the runtime habit. Find the ownership blur. Then decide whether the fix belongs in controls, platform defaults, workload shape, or team operating model.

Step 1 — Strip out noise

Ignore theatre, dashboards, and language that only describes symptoms. Identify the real technical or ownership constraint.

Step 2 — Change the system, not just the explanation

Pick the smallest architectural, operational, or control-layer move that materially improves safety, clarity, runtime behaviour, or cost.

Step 3 — Leave behind something operable

Make the result visible enough that teams can keep running it without heroics, memory, or permanent consultant dependence.