Selected work · case studies · public proof
Public-facing proof for the kind of AWS platform problems I solve.
This is the sanitized version of the work: specific enough to show judgment, stripped back enough to stay public-safe, and written for buyers who care more about outcomes than buzzwords.
How to read these
These are not polished marketing fairy tales.
They are short narrative slices: the environment shape, the real failure mode, what changed, and why the result mattered. The names are softened where they need to be, but the engineering pattern is real.
Flagship case study · governance operating model
Governance and compliance execution in regulated AWS estates
The visible issue was compliance drift. The underlying failure was that controls, ownership, and reporting had decoupled from each other, so the estate looked governed on paper while behaving ambiguously in practice.
This kind of estate usually does not fail because nobody cares about governance. It fails because the control system has become detached from runtime reality. Findings exist, dashboards exist, owners exist in theory, but the path from “something drifted” to “someone corrects it” is full of ambiguity.
The estate had the classic shape: multiple accounts, regulated requirements, a growing stack of guardrails, detective findings, and reporting outputs. On the surface that looks mature. Under load it often behaves like three overlapping systems that never fully agree with each other.
The main issue was not a missing policy or a missing dashboard. It was operating-boundary confusion. Platform owned some controls, security interpreted others, delivery teams triggered the underlying findings, and exceptions passed between all three without a clean accountability path.
Preventive controls were often too broad to be useful, detective controls generated noise without enough routing context, and evidence collection made audit conversations possible without making remediation reliable. The system was legible to governance people but not operable for delivery teams.
The work moved the control surface closer to an actual operating model: tighten where preventive controls belong, route detective findings to named owners, make exceptions explicit instead of tribal, and align reporting with accountable action rather than with abstract status visibility.
A preventive or detective control trips, but context is thin and the system does not distinguish well between breach, exception, and noise.
Dashboards and reports make drift visible, but they do not make the next owner or next action obvious.
Platform, security, and delivery each see a partial problem, so findings bounce or sit while exception logic lives in tickets and memory.
Preventive controls are used where they genuinely protect boundaries; detective controls are tuned around actionable routing rather than raw volume.
Reporting carries enough context that the issue lands with the team responsible for the boundary, workload, or exception path.
The estate becomes easier to run because compliance evidence, escalation, and remediation now reinforce each other instead of competing.
Having reports is not the same as having a governable platform. The first move was separating visibility from actionability.
Some problems belonged in guardrails, some in reporting, and some in operating boundaries between teams. Treating them as one category was part of the failure.
The result needed to survive after the workshop or audit cycle ended. That meant explicit routing, explicit ownership, and less dependence on memory.
Outcome: stronger compliance execution and clearer control behaviour without adding more governance theatre. The real win was not another layer of reporting. It was reducing the gap between architecture intent, control evidence, and the people who actually had to act.
Case study 02
Cloud cost control with engineering teeth
The spend issue looked financial from a distance. Up close, it was a platform design and runtime behaviour problem.
Cost pressure was visible, but the estate had already normalised a set of habits that made waste feel structural: always-on runtime assumptions, weak environment lifecycle discipline, and workloads placed or shaped in ways that quietly accumulated expense.
Dashboards existed. Visibility was not the limiting factor. The real issue was that cost ownership had drifted away from engineering decisions, so expensive behaviour kept being designed back into the platform.
Focused on the runtime and lifecycle habits behind the spend: commitment optimisation, utilisation discipline, environment scheduling, and workload reshaping where the execution model itself was wasteful.
This was useful because the result survived after the reporting conversation ended. The saving came from engineering change, not a better meeting about cost.
Case study 03
Modernising long-running workloads into cleaner ECS task patterns
The workload technically ran. It just ran in a shape that was expensive, awkward to scale, and harder to reason about than it needed to be.
Some workloads stay alive long after their original execution model stopped making sense. They become always-on by habit, consume more resources than needed, and create operational complexity because their trigger, worker, and failure paths were never cleanly separated.
The issue was not just efficiency. The workload shape itself was wrong for the behaviour being asked of it. That meant isolation, retries, scaling behaviour, and failure handling were all carrying unnecessary ambiguity.
Moved the execution pattern toward event-driven ECS tasks with clearer separation between trigger, worker, retry, and dead-letter paths. Reduced always-on assumptions and made the workload easier to scale and reason about under uneven demand.
Better architecture here meant lower operational drag, clearer failure handling, and a platform shape that was more aligned with the actual runtime behaviour of the system.
Case study 04
Reliability remediation for a public-facing digital platform
The visible symptom was repeated dropouts. The real work was tracing the failure pattern through architecture and performance behaviour until the outage loop stopped.
When a public-facing platform drops out repeatedly, everyone feels the pain but the remediation often gets fragmented across layers. One team sees hosting, another sees application behaviour, another sees traffic, and the actual outage pattern survives all of them.
The main risk in this kind of work is chasing symptoms. If the investigation stays too shallow, teams end up stabilising around the outage instead of removing it.
Worked through the real failure path rather than surface explanations, addressing the architecture and performance behaviour driving the instability. The job was not to make the graphs prettier. It was to stop the user-visible failure from recurring.
Reliability work is only meaningful if the outage pattern actually disappears. In this case it did.