Elevated errors on TTS & STT API

Incident Report for Cartesia AI

Postmortem

Incident summary

On October 14, Cartesia’s API experienced a major outage between 1:38 PM and 4:16 PM PT, resulting in 10–15 minutes of complete downtime, 45 minutes of high latency, and 1 hour 45 minutes of degraded performance with elevated response times of +200–400 ms. The root cause was that the Kubernetes control, network, and compute planes shared nodes, allowing unbounded workloads from a new deployment to overload control plane nodes and trigger cascading failures. The issue was resolved after isolating and remediating the affected workloads. To prevent recurrence, we are enforcing pod resource limits, isolating control/ingress/compute planes, and expanding regional surge capacity to improve recovery times in future incidents.

Impact

  • 10-15 minutes of complete outage
  • 45 minutes of high latencies while bringing up regional clusters to serviceable capacity
  • 1 hour 45 minutes of overall elevated latencies serving from EU/APAC

Root Cause Analysis

The core contributor to this issue was that in our primary inference cluster, the Kubernetes control plane, network plane, and compute plane didn’t have node level isolation. As a result, the K8s API Server, Ingress controllers and custom workloads could be colocated on the same node.

At 1:38 PM, we deployed workloads with a high replica count. These workloads were scheduled on the controller plane nodes, which overloaded the nodes. During the process of recovery on the controller nodes, the workloads were rescheduled on the same nodes, causing a thundering herd of crash loops.

The root cause was identified after about 20 minutes, but full restoration was delayed as the underlying nodes required manual intervention from our cloud service provider (CSP). By 4:12 PM, the cluster admins were able to cordon the Control Plane, bring it back, and delete the bad workloads, resuming service to our remaining cluster.

Detection and Timeline

  • 1:38 PM - Got pages for primary API cluster being inaccessible.
  • 1:40 PM - Pinged our primary CSP that the cluster is inaccessible via SSH
  • 1:50 PM - Sharded traffic globally across different AZs
  • 2:20 PM - Added sufficient surge capacity across different AZs to serve traffic stably.
  • 3:49 PM - CSP reports cluster is back up. However, the Ingress is still unusable due to the head nodes being cordoned.
  • 4:16 PM - Ingress and Control plane are back up, traffic is moved back to US and stable

Lessons learned and Next Steps

  • Prevention : Running a workload with unbounded resource limits should not be permitted. We ran a safe workload for a typical compute node, but in the case of a much smaller control plane node, the lack of limits OOMed the entire node.

    • Moving forward, we’re adding validations to ensure that all workloads have a requests and limits field configured for our Pods.
  • Prevention : A shared K8s control, ingress and compute plane is not common practice. We are working with the cloud service provider to get this remediated and move to more standard practices around complete separation of concerns.

    • The aimed solution here is to have an Unschedulable or Cordoned set of Control Plane nodes (In the case of CSPs like AWS, these are usually not even exposed), and a separate set of nodes for managing Ingress controllers.
  • Mitigation : We are working on increasing our provisioned capacity on backup clusters on different CSP in the same locale. This will increase robustness to failures in a single CSP and ensure fast failover recovery.

Posted Oct 15, 2025 - 23:47 UTC

Resolved

This incident has now been resolved, we'll follow up with an RCA shortly!
Posted Oct 14, 2025 - 23:17 UTC

Update

We've rolled out a fix and US traffic is being routed to our US clusters. Latencies should be within normal ranges now.
Posted Oct 14, 2025 - 23:16 UTC

Update

We are still working with the primary provider to bring the cluster back up, but in the meanwhile traffic is served from our global clusters at about `100-200ms` higher latencies for our US traffic.
Posted Oct 14, 2025 - 22:32 UTC

Update

We've re-routed traffic to our other clusters; the API is still degraded due to elevated latencies on TTS & STT services.
Posted Oct 14, 2025 - 21:34 UTC

Identified

We're facing an outage with an infrastructure provider, we're actively working on a fallback.
Posted Oct 14, 2025 - 20:52 UTC

Update

We are continuing to investigate this issue.
Posted Oct 14, 2025 - 20:47 UTC

Investigating

We are currently investigating this issue.
Posted Oct 14, 2025 - 20:47 UTC
This incident affected: Text to Speech (TTS) (Text to Speech (US)) and Speech to Text (STT) (Speech to Text (US)).