99.9% Uptime
Reaching 99.9% Uptime With a Multi-AZ EKS Rebuild
Three outages in six months became zero in twelve, after a multi-AZ EKS rebuild and an observability overhaul.
The Challenge
A SaaS company running on a single-region EKS cluster had experienced three significant outages in six months, each lasting 45 to 90 minutes. The cluster had no pod disruption budgets, no horizontal pod autoscaling, and a custom alerting setup that generated so many false positives the team had started ignoring pages. One AZ failure would bring down the entire application.
The Approach
We designed a multi-AZ EKS architecture with node groups distributed across three availability zones and pod anti-affinity rules ensuring critical services spread across AZs. Pod disruption budgets were implemented for all stateful services. We rebuilt the observability stack using Prometheus and Grafana with properly tuned alerting thresholds, reducing alert volume by 85% while improving signal quality. HPA was configured for all variable-load services based on custom business metrics.
The Result
Zero downtime incidents in 12 months following the migration. Alert fatigue was eliminated. The team went from ignoring pages to treating every alert as a real signal. The Grafana dashboards give the team real-time visibility into application and infrastructure health, and the on-call runbooks mean any engineer can handle common incidents without escalation. The company successfully passed a security audit that required demonstrated uptime SLA evidence.
Client identities are kept confidential by agreement. Metrics are verified and unexaggerated.
The Service Behind This
How we'd approach yours
Kubernetes Setup & Migration
Production-grade EKS, set up right the first time.
Explore this sprintMore results
Have a similar challenge?
Book a free 30-minute infrastructure assessment and we'll show you where the same gains are hiding in your setup.
Free · No commitment · Reply within 12-24 hours