A postmortem has been posted for why ChatGPT went down.
What is a Postmortem in tech? A technical postmortem is a retrospective analysis of events that resulted in a technical failure. The purpose of a technical postmortem is to: Find out what went wrong and why. Identify trouble areas. Determine what can be done to prevent future failures.
Now that we know what a technical postmortem is, let’s dive into what happened with ChatGPT
The root cause was the same as for this DALL·E Web Interface incident.
The hosts which were serving DALL·E’s Web Experience and the text-curie-001 API went offline. This was due to hosts not properly joining the Kubernetes cluster. The nodes didn’t re-join the cluster due to timing issues of a particular GPU diagnostics command that exceeded a timeout.
The article went on to say “We do not have control over this timeout or boot script since this is managed by our service provider. This was not anticipated since this behavior is unique to a particular node type in a particular region. The nodes were being cycled as part of a planned Kubernetes version upgrade.”
text-curie-001 was quickly moved to an unaffected node and service was restored.
Now the question beckons, why was text-curie-001 not moved to a healthy node?
Due to the size of DALL·E’s infrastructure and limited capacity, moving to healthy nodes was not an option. The resulting decrease in capacity degraded DALL·E service, as the request queue grew long enough that most requests timed out before image generations could be served.
Why was the issue not solved?
The statement goes on to say “During this incident, we introduced several levers for graceful load shedding in events where DALL·E receives more requests than it can support. To implement one of these levers, we ran a database migration. This migration stalled, had to be rolled back, and then retried due to unexpected row locks. During this time we were unable to serve DALL·E and this issue exacerbated our recovery time.“
What is the Plan moving forward for OpenAI?
Moving forward, we are implementing additional levers for load shedding and investigating alternative means of serving greater numbers of requests, given capacity constraints. One such lever is rejecting all inbound requests when the request queue grows beyond a certain length if the request would certainly time out before returning anyway. Additionally, we are reconfiguring our nodes to give us full control over boot-up scripts and adding new procedures to check for unexpected inconsistencies before full node cycles.