Selected Work · Case Studies · Public Proof

Evidence, not branding copy.

Public-facing proof for the kind of AWS platform problems I solve. Sanitised for confidentiality — names and specifics generalised, but the engineering patterns are real.

Scope

Governance, cost discipline, workload modernisation, reliability, and platform operating model cleanup.

Pattern

The underlying problem is usually not lack of tooling — it's blurry ownership, weak runtime defaults, or controls in the wrong layer.

These are short narrative slices: environment shape, real failure mode, what changed, why it mattered. The names are softened where needed, but the engineering pattern is real.

Governance and compliance execution in regulated AWS estates

Flagship Case Study · Governance Operating Model

The visible issue was compliance drift. The underlying failure was that controls, ownership, and reporting had decoupled from each other — so the estate looked governed on paper while behaving ambiguously in practice.

Environment Shape

Multi-account AWS estate in a regulated setting with platform, security, and delivery teams all touching the control surface.

Failure Pattern

Preventive controls were blunt, detective controls were noisy, and exceptions lived in tickets, memory, and side conversations.

Actual Intervention

Reworked the operating model around guardrails, reporting, escalation paths, and ownership boundaries so findings could land somewhere actionable.

Outcome

Stronger compliance execution and clearer control behaviour without adding more governance theatre. The real win was reducing the gap between architecture intent, control evidence, and the people who actually had to act.

Cloud cost control with engineering teeth

Cost & Operating Discipline

The spend issue looked financial from a distance. Up close, it was a platform design and runtime behaviour problem.

Situation

Cost pressure was visible, but the estate had already normalised a set of habits that made waste feel structural: always-on runtime assumptions, weak environment lifecycle discipline, and workloads placed or shaped in ways that quietly accumulated expense.

Problem Beneath the Problem

Dashboards existed. Visibility was not the limiting factor. The real issue was that cost ownership had drifted away from engineering decisions, so expensive behaviour kept being designed back into the platform.

What Changed

Focused on the runtime and lifecycle habits behind the spend: commitment optimisation, utilisation discipline, environment scheduling, and workload reshaping where the execution model itself was wasteful.

Outcome

Material annual savings driven by engineering change, not just reporting. The result survived after the reporting conversation ended.

Modernising long-running workloads into cleaner ECS task patterns

Workload Modernisation

The workload technically ran. It just ran in a shape that was expensive, awkward to scale, and harder to reason about than it needed to be.

Situation

Some workloads stay alive long after their original execution model stopped making sense. They become always-on by habit, consume more resources than needed, and create operational complexity because their trigger, worker, and failure paths were never cleanly separated.

Problem Beneath the Problem

The issue was not just efficiency. The workload shape itself was wrong for the behaviour being asked of it. That meant isolation, retries, scaling behaviour, and failure handling were all carrying unnecessary ambiguity.

What Changed

Moved the execution pattern toward event-driven ECS tasks with clearer separation between trigger, worker, retry, and dead-letter paths. Reduced always-on assumptions and made the workload easier to scale and reason about under uneven demand.

Outcome

Better scaling, cleaner failure handling, and less operational drag. The platform shape was more aligned with the actual runtime behaviour of the system.

Reliability remediation for a public-facing digital platform

Reliability Under Pressure

The visible symptom was repeated dropouts. The real work was tracing the failure pattern through architecture and performance behaviour until the outage loop stopped.

Situation

When a public-facing platform drops out repeatedly, everyone feels the pain but the remediation often gets fragmented across layers. One team sees hosting, another sees application behaviour, another sees traffic, and the actual outage pattern survives all of them.

Problem Beneath the Problem

The main risk in this kind of work is chasing symptoms. If the investigation stays too shallow, teams end up stabilising around the outage instead of removing it.

What Changed

Worked through the real failure path rather than surface explanations, addressing the architecture and performance behaviour driving the instability. The job was not to make the graphs prettier — it was to stop the user-visible failure from recurring.

Outcome

Critical website dropouts for Australian Museum reduced from roughly 20 per day to zero. Reliability work is only meaningful if the outage pattern actually disappears — in this case it did.

How I usually approach this work

The recurring pattern is boring in a good way:

Strip out noise — ignore theatre, dashboards, and language that only describes symptoms. Identify the real technical or ownership constraint.
Change the system, not just the explanation — pick the smallest architectural, operational, or control-layer move that materially improves safety, clarity, runtime behaviour, or cost.
Leave behind something operable — make the result visible enough that teams can keep running it without heroics, memory, or permanent consultant dependence.

Next steps

Work with me → View resume → Technical writing →