On October 14, Cartesia’s API experienced a major outage between 1:38 PM and 4:16 PM PT, resulting in 10–15 minutes of complete downtime, 45 minutes of high latency, and 1 hour 45 minutes of degraded performance with elevated response times of +200–400 ms. The root cause was that the Kubernetes control, network, and compute planes shared nodes, allowing unbounded workloads from a new deployment to overload control plane nodes and trigger cascading failures. The issue was resolved after isolating and remediating the affected workloads. To prevent recurrence, we are enforcing pod resource limits, isolating control/ingress/compute planes, and expanding regional surge capacity to improve recovery times in future incidents.
The core contributor to this issue was that in our primary inference cluster, the Kubernetes control plane, network plane, and compute plane didn’t have node level isolation. As a result, the K8s
API Server, Ingress controllers and custom workloads could be colocated on the same node.
At 1:38 PM
, we deployed workloads with a high replica count. These workloads were scheduled on the controller plane nodes, which overloaded the nodes. During the process of recovery on the controller nodes, the workloads were rescheduled on the same nodes, causing a thundering herd of crash loops.
The root cause was identified after about 20 minutes
, but full restoration was delayed as the underlying nodes required manual intervention from our cloud service provider (CSP). By 4:12 PM
, the cluster admins were able to cordon the Control Plane, bring it back, and delete the bad workloads, resuming service to our remaining cluster.
1:38 PM
- Got pages for primary API cluster being inaccessible.1:40 PM
- Pinged our primary CSP that the cluster is inaccessible via SSH1:50 PM
- Sharded traffic globally across different AZs2:20 PM
- Added sufficient surge capacity across different AZs to serve traffic stably.3:49 PM
- CSP reports cluster is back up. However, the Ingress is still unusable due to the head nodes being cordoned.4:16 PM
- Ingress and Control plane are back up, traffic is moved back to US and stablePrevention
: Running a workload with unbounded resource limits should not be permitted. We ran a safe workload for a typical compute node, but in the case of a much smaller control plane node, the lack of limits OOMed the entire node.
requests
and limits
field configured for our Pods
.Prevention
: A shared K8s
control, ingress and compute plane is not common practice. We are working with the cloud service provider to get this remediated and move to more standard practices around complete separation of concerns.
Unschedulable
or Cordoned set of Control Plane nodes (In the case of CSPs like AWS, these are usually not even exposed), and a separate set of nodes for managing Ingress controllers.Mitigation
: We are working on increasing our provisioned capacity on backup clusters on different CSP in the same locale. This will increase robustness to failures in a single CSP and ensure fast failover recovery.