SaaSSeries B

99.9% Uptime

Reaching 99.9% Uptime With a Multi-AZ EKS Rebuild

Three outages in six months became zero in twelve, after a multi-AZ EKS rebuild and an observability overhaul.

EKSKubernetesPrometheusGrafana

The Challenge

A SaaS company running on a single-region EKS cluster had experienced three significant outages in six months, each lasting 45 to 90 minutes. The cluster had no pod disruption budgets, no horizontal pod autoscaling, and a custom alerting setup that generated so many false positives the team had started ignoring pages. One AZ failure would bring down the entire application.

The Approach

We designed a multi-AZ EKS architecture with node groups distributed across three availability zones and pod anti-affinity rules ensuring critical services spread across AZs. Pod disruption budgets were implemented for all stateful services. We rebuilt the observability stack using Prometheus and Grafana with properly tuned alerting thresholds, reducing alert volume by 85% while improving signal quality. HPA was configured for all variable-load services based on custom business metrics.

The Result

Zero downtime incidents in 12 months following the migration. Alert fatigue was eliminated. The team went from ignoring pages to treating every alert as a real signal. The Grafana dashboards give the team real-time visibility into application and infrastructure health, and the on-call runbooks mean any engineer can handle common incidents without escalation. The company successfully passed a security audit that required demonstrated uptime SLA evidence.

Client identities are kept confidential by agreement. Metrics are verified and unexaggerated.

Have a similar challenge?

Book a free 30-minute infrastructure assessment and we'll show you where the same gains are hiding in your setup.

Book Free Infrastructure Assessment

Free · No commitment · Reply within 12-24 hours