> What is surprising is that a classic Time-of-check-time-of-use (TOCTOU) bug was latent in their system until Monday. Call me naive, but I thought AWS, an organization running a service that averages 100 million RPS, would have flushed out TOCTOU bugs from their critical services.
Yeah, right. I'm surprised how anyone involved with software engineering can be surprised by this. I would argue that there many, if not infinitely many, similar bugs out there. It's just that the right conditions for them to show up haven't been met yet.
I had a similar thought. TOCTOU bugs could be anywhere and only take a few lines of code to create the conditions for them, but have no immediate warning they exist unless you're looking for them.
Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.
Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.
Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.
We can draw inspiration from older dns infrastructure like the root servers. They use a list of names rather than a single name. We can imagine if the root (".") was a single nameserver that was distributed with anycast, and how a single misconfiguration would bring down the whole internet. Instead we have a list of name servers, operated by different entities, and the only thing that should happen if one goes down is that the next one get used after a timeout.
The article bring up a fairly important point in impact reductions from bugs. Critical systems need to have sanity checks for states and values that never should occur during normal operation, with some corresponding action in case they happen. End-points could have had sanity checks of invalid DNS, such as zero ip-addresses or broken DNS, and either reverted back to an valid state or a predefined emergency system. Either would have reduced the impact.
> but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator ...
Tangent, but players of Satisfactory might recognize this condition. If your main power infrastructure somehow goes down, you might not have enough power to start the pumps/downstream machines to power up your main power generators. Thus it's common to have some Tier 0 generators stashed away somewhere to kick start the rest of the system (at least before giant building-sized batteries were introduced a few updates ago).
I think a lot problems across different systems have a similar issue. You have a system that needs to have some autonomy (like a flying aeroplane). It has a sources of authority (say a sensor, ATC) but that sometimes is unavailable, delayed, gives wrong data. When that happens we are unwilling to fall back on more autonomy and automation. But there is limited scope for human intervention due to the scale of the problem or just technical difficulty. We reach an inflection point where the only direction left is to give up some element of human control. Accept that systems will sometimes receive bad data and need some autonomy to ignore it when it is contraindicated. And that higher level control is just another source of possible false data.
there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)
They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.
The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.
I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).
This is not exactly true. The az names are indeed randomized per account, and this is the identifier that you see everywhere in the APIs. The difference now is that they also expose a mapping from AZ name (randomized) to AZ id (not randomized), so that you can know that AZ A in one account is actually in the same datacenter as AZ B in a different account. This becomes quite relevant when you have systems spread across accounts but want the communication to stay zonal.
Oh wow. Thanks for telling me this. I didn't know that this was different for different regions. I just checked some of my accounts, and indeed the mapping is stable between accounts for for example Frankfurt, but not Sydney.
The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)
But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)
I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you? It also requires tight coupling between the load balancer and the backends, which has problems of its own.
I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.
my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.
and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.
simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward
(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))
The simplest and most effective strategy we know today to automatically recover that gives the impacted service the ability to avoid entering a metastable state is for clients to implement retries with exponential backoff. No circuit breaker-type functionality is required. Unfortunately it requires that clients be well behaved.
Also, circuit breakers have issues of their own:
“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.”
Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity.
(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)
Are you saying it's different on land-based steam power plants? Why?
Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.
If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.
Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.
A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.
As someone building a SaaS product launching soon, outages like this are incredibly instructive—though also terrifying.
The "cold start" analogy resonates. As an indie builder, I'm constantly making decisions that optimize for "ship fast" vs "what happens when this breaks." The reality is: you can't architect for every failure mode when you're a team of 1-2 people.
But here's what's fascinating about this analysis: the recommendation for control theory and circuit breakers isn't just for AWS-scale systems. Even small products benefit from graceful degradation patterns. The difference is—at AWS scale, a metastable failure can cascade for 14 hours. At indie scale, you just... lose customers and never know why.
The talent exodus point is also concerning. If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have? This is why I'm increasingly convinced that boring, well-documented tech stacks matter more than cutting-edge ones. When you're solo, you need to be able to debug at 2am without digging through undocumented microservice chains.
Question: For those running prod systems, how do you balance "dogfooding" (running on your own infra) vs "use boring, reliable stuff" (like AWS for your control plane)? Is there a middle ground?
For a startup, boring tech stacks are absolutely the correct choice. Using an opinionated framework like Django can be a very good idea for the same reason - it provides a structure which you just have to follow without too many decisions to make, and any new hires with experience in the framework can hit the ground running to a larger extent.
The only exception is if you are doing something where your tech stack is integral to your product (e.g. you need crazily high performance or scale or something from the get go).
Split out the front ends into separate services but leave the back end as a monolith, just try not to get logically separate parts too entangled so you have a decent chance of separating them later if and when "later" arrives
> If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have?
My personal crusade is trying to convince anyone who will listen to me that the value of a senior employee in any role isn't in their years of experience. It's in their years of experience with your system. More companies should do whatever it takes to retain employees long term. No matter how good your documentation procedures are, no matter how thorough your knowledge base, no matter how many knowledge transfer sessions you have, your long tenured employees have forgotten they know more knowledge than they will ever be able document.
I have never seen a team lose a (productive) long standing member that still wasn't suffering from that loss years later. We like to pretend that software and especially reusable libraries, components and frameworks make developers interchangeable widgets. Substitutable cogs that can simply be swapped out by having enough documentation. But writing software is converting a mental state into a physical one. And reading code is trying to read someone else's mind (incidentally this is part of why two excellent developers can find themselves unable to work together, they just don't think enough alike). When you lose those senior people, you lose the person who was around for that one outage 8 years ago, and who if they were around for the current outage would have a memory float up from the forgotten depths that cause them to check on the thing no one (yet) suspects.
This isn't to say you shouldn't document things, and that you shouldn't work to spread knowledge around. The "bus factor" is real. People will retire. People will leave for reasons you have no control over. So document everything you can, build good knowledge transfer systems. But always remember that in the end, they will only cover a portion of the knowledge your employees have. And the longer your employees are around, the more knowledge they will have. And unless you are willing to pay your people to document everything they know full time (and I assure you, neither you nor your customers are willing to do that), they will always know more than is documented.
Nor despite my analogies does this apply only to software developers. One team I worked on lost a long term "technical writer" (that wasn't their full time work but it was a role they filled for that team). This person was the best person I've ever known at documenting things. And still the team reeled at losing them. Even with all the documentation they left behind, knowing what was documented, where and how recently and all sorts of metadata that wasn't captured in the documentation itself went with them when they left. Years later their documentation was still useful, but no one was anywhere near their level of encyclopedic knowledge of it all.
To use an analogy, the New York Yankees perform the best when they focus on building up their roster from their farm system. Long tenured players mentor the new ones, and a cohesive team forms that operates as more than the sum of its parts. They perform the worst when success and money take control and they try to hire "all-stars" away from other teams. The intangibles just aren't there, or aren't as strong. A good company is the same. A long term team built from the ground up is worth far more than a similarly "experienced" team made up of outside hires.
Talent exodus is concerning, so don't give your people a reason to leave. You can't stop it all, but you can do a lot to avoid self inflicted wounds.
I get to live this routinely in consulting projects, where management thinks teams are like playing footbal, replacing players mid game and keep going, naturally it isn't without all sorts of hickups as you well point out.
I think a lot of developers may not appreciate the fleshy human parts of organisations, given most of our efforts to date as an industry are typically to reduce headcount, but staff really are key to any successful organisation.
I have never heard of a control plane before this AWS outage, sounds instructive without needing an explanation but for 1-2 person what does a control plane provide?
“Control plane” and “data plane” are very common terms. It’s just a way to think about responsibilities within a system, not necessarily its physical architecture.
> Most obviously, RCA has an infinite regress problem
Root cause analysis is just like any other tool. Failure to precisely define the nature of the problem is what usually turns RCA into a wild goose chase. Consider the following problem statements:
"The system feels yucky to use. I don't like it >:("
"POST /reports?option=A is slow around 3pm on Tuesdays"
One of these is more likely to provide a useful RCA that proceeds and halts in a reasonable way.
"AWS went down"
Is not a good starting point for a useful RCA session. "AWS" and "down" being the most volatile terms in this context. What parts of AWS? To what extent were they down? Is the outage intrinsic to each service or due to external factors like DNS?
"EC2 instances in region X became inaccessible to public internet users between Y & Z"
This is the kind of grain I would be doing my PPTX along if I was working at AWS. You can determine that there was a common thread after the fact. Put it in your conclusion / next-steps slide. Starting hyper-specific means that you are less likely to get distracted and can arrive at a good answer much faster. Aggregating the conclusions of many reports, you could then prioritize the strategy for preventing this in the future.
> In normal operation of EC2, the DWFM maintains a large number (~10^6) of active leases against physical servers and a very small number (~10^2) of broken leases, the latter of which the manager is actively attempting to reestablish.
It sounds like fixing broken leases takes a lot more work than renewing functional leases? Funnily enough, there is already an AWS blog post about specifically preventing these types of issues: https://aws.amazon.com/builders-library/reliability-and-cons...
Legitimate question on if the talent exodus from AWS is starting to take its toll. I’m talking about all the senior long-turned folks jumping ship for greener pastures, not the layoffs this week which mostly didn’t touch AWS (folks saying that will happen in future rounds).
The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.
If you average it out over the last decade do we really have more outages now than before? Any complex system with lots of moving parts is bound to fail every so often.
It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.
The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.
Not really. The issue was the time it took to correctly diagnose the issue and then the cascading failures that resulted triggering more lengthy troubleshooting. Rightly or wrongly it plays into the “the folks that knew best how all this works have left the building” vibes. Folks inside AWS say that’s not entirely inaccurate.
Easy: alone, struggling to contact coworkers (while mostly trying to diagnose the problem). I've done both (the alone state didn't last for hours because we did have emergency communication channels, and the hotel was a ski lodge in my case). The surrounded by coworkers is much better.
That's assuming these are actual coworkers, not say a dozen escalating levels of micromanagers, which I agree would be hell. The where isn't really that important in my experience, as long as you've got reliable Internet access.
It looks from the public writeup that the thing programming the DNS servers didn't acquire a lease on the server to prevent concurrent access to the same record set. I'd love to see the internal details on that COE.
I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.
Man, AWS is at it again. That big outage on Oct 20 was rough, and now (Oct 29) it looks like things are shaky again. SMH.
us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.
Tbh, for most companies/orgs the cost/complexity of multi region just isn't worth it.
The cost of a work days worth of downtime is rarely enough to justify the expense of trying to deploy across multiple regions or clouds.
Esp if you are public facing and not internal. You just go 'well everyone else was down to because of aws' and your customers just go 'ah okay fair enough'
> What is surprising is that a classic Time-of-check-time-of-use (TOCTOU) bug was latent in their system until Monday. Call me naive, but I thought AWS, an organization running a service that averages 100 million RPS, would have flushed out TOCTOU bugs from their critical services.
Yeah, right. I'm surprised how anyone involved with software engineering can be surprised by this. I would argue that there many, if not infinitely many, similar bugs out there. It's just that the right conditions for them to show up haven't been met yet.
I had a similar thought. TOCTOU bugs could be anywhere and only take a few lines of code to create the conditions for them, but have no immediate warning they exist unless you're looking for them.
Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.
Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.
Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.
We can draw inspiration from older dns infrastructure like the root servers. They use a list of names rather than a single name. We can imagine if the root (".") was a single nameserver that was distributed with anycast, and how a single misconfiguration would bring down the whole internet. Instead we have a list of name servers, operated by different entities, and the only thing that should happen if one goes down is that the next one get used after a timeout.
The article bring up a fairly important point in impact reductions from bugs. Critical systems need to have sanity checks for states and values that never should occur during normal operation, with some corresponding action in case they happen. End-points could have had sanity checks of invalid DNS, such as zero ip-addresses or broken DNS, and either reverted back to an valid state or a predefined emergency system. Either would have reduced the impact.
> but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator ...
Tangent, but players of Satisfactory might recognize this condition. If your main power infrastructure somehow goes down, you might not have enough power to start the pumps/downstream machines to power up your main power generators. Thus it's common to have some Tier 0 generators stashed away somewhere to kick start the rest of the system (at least before giant building-sized batteries were introduced a few updates ago).
I think a lot problems across different systems have a similar issue. You have a system that needs to have some autonomy (like a flying aeroplane). It has a sources of authority (say a sensor, ATC) but that sometimes is unavailable, delayed, gives wrong data. When that happens we are unwilling to fall back on more autonomy and automation. But there is limited scope for human intervention due to the scale of the problem or just technical difficulty. We reach an inflection point where the only direction left is to give up some element of human control. Accept that systems will sometimes receive bad data and need some autonomy to ignore it when it is contraindicated. And that higher level control is just another source of possible false data.
A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact
Having such a gobstoppingly massive singular region seems to be working against AWS
DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!
there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)
They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.
The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.
I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).
This is not exactly true. The az names are indeed randomized per account, and this is the identifier that you see everywhere in the APIs. The difference now is that they also expose a mapping from AZ name (randomized) to AZ id (not randomized), so that you can know that AZ A in one account is actually in the same datacenter as AZ B in a different account. This becomes quite relevant when you have systems spread across accounts but want the communication to stay zonal.
You're both partially right. Some regions have random mapping for AZs; all regions since 2012 have static AZ mapping. https://docs.aws.amazon.com/global-infrastructure/latest/reg...
Oh wow. Thanks for telling me this. I didn't know that this was different for different regions. I just checked some of my accounts, and indeed the mapping is stable between accounts for for example Frankfurt, but not Sydney.
The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)
But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)
I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you? It also requires tight coupling between the load balancer and the backends, which has problems of its own.
I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.
thanks for the YT paper!
my point is there's no need to try (and fail) to define some universal backpressure semantics between coupling points, after all this can be done locally, and even after the fact (every time there's an outage, or better yet every time there's a "near miss") the signal to listen to will show up.
and if not, then not, which means (as you said) that link likely doesn't have this kind of simple semantics. maybe because the nature of the integration is not request-response or not otherwise structured to provide this apparent legibility, even if it's causally important for downstream.
simply thinking about this during post-mortems, having metrics available (which is anyway a given in these complex high-availability systems), having the option in the SDK, seems like the way forward
(yes, I know this is basically the circuit breaker and other Netflix-evangelized ideas with extra steps :))
The simplest and most effective strategy we know today to automatically recover that gives the impacted service the ability to avoid entering a metastable state is for clients to implement retries with exponential backoff. No circuit breaker-type functionality is required. Unfortunately it requires that clients be well behaved.
Also, circuit breakers have issues of their own:
“Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted.”
Consider a situation in which all the clients have circuit breakers. All of them enter the open state once the trigger condition is met, which drops request load on the service to zero. Your autoscaler reduces capacity to the minimum level in response. Then, all the circuit breakers are reset to the closed state. Your service then experiences a sudden rush of normal- or above-normal traffic, causing it to immediately exhaust availabile capacity.
https://aws.amazon.com/builders-library/timeouts-retries-and...
(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)
Are you saying it's different on land-based steam power plants? Why?
Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.
If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.
Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.
A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.
I was motivated by your back-and-forth in the original AWS summary to go and write this post :)
It's good, and I love that you brought the Google SRE stuff into it.
As someone building a SaaS product launching soon, outages like this are incredibly instructive—though also terrifying.
The "cold start" analogy resonates. As an indie builder, I'm constantly making decisions that optimize for "ship fast" vs "what happens when this breaks." The reality is: you can't architect for every failure mode when you're a team of 1-2 people.
But here's what's fascinating about this analysis: the recommendation for control theory and circuit breakers isn't just for AWS-scale systems. Even small products benefit from graceful degradation patterns. The difference is—at AWS scale, a metastable failure can cascade for 14 hours. At indie scale, you just... lose customers and never know why.
The talent exodus point is also concerning. If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have? This is why I'm increasingly convinced that boring, well-documented tech stacks matter more than cutting-edge ones. When you're solo, you need to be able to debug at 2am without digging through undocumented microservice chains.
Question: For those running prod systems, how do you balance "dogfooding" (running on your own infra) vs "use boring, reliable stuff" (like AWS for your control plane)? Is there a middle ground?
For a startup, boring tech stacks are absolutely the correct choice. Using an opinionated framework like Django can be a very good idea for the same reason - it provides a structure which you just have to follow without too many decisions to make, and any new hires with experience in the framework can hit the ground running to a larger extent.
The only exception is if you are doing something where your tech stack is integral to your product (e.g. you need crazily high performance or scale or something from the get go).
Split out the front ends into separate services but leave the back end as a monolith, just try not to get logically separate parts too entangled so you have a decent chance of separating them later if and when "later" arrives
> If even AWS—with their resources and institutional knowledge—can struggle when senior folks leave, what chance do startups have?
My personal crusade is trying to convince anyone who will listen to me that the value of a senior employee in any role isn't in their years of experience. It's in their years of experience with your system. More companies should do whatever it takes to retain employees long term. No matter how good your documentation procedures are, no matter how thorough your knowledge base, no matter how many knowledge transfer sessions you have, your long tenured employees have forgotten they know more knowledge than they will ever be able document.
I have never seen a team lose a (productive) long standing member that still wasn't suffering from that loss years later. We like to pretend that software and especially reusable libraries, components and frameworks make developers interchangeable widgets. Substitutable cogs that can simply be swapped out by having enough documentation. But writing software is converting a mental state into a physical one. And reading code is trying to read someone else's mind (incidentally this is part of why two excellent developers can find themselves unable to work together, they just don't think enough alike). When you lose those senior people, you lose the person who was around for that one outage 8 years ago, and who if they were around for the current outage would have a memory float up from the forgotten depths that cause them to check on the thing no one (yet) suspects.
This isn't to say you shouldn't document things, and that you shouldn't work to spread knowledge around. The "bus factor" is real. People will retire. People will leave for reasons you have no control over. So document everything you can, build good knowledge transfer systems. But always remember that in the end, they will only cover a portion of the knowledge your employees have. And the longer your employees are around, the more knowledge they will have. And unless you are willing to pay your people to document everything they know full time (and I assure you, neither you nor your customers are willing to do that), they will always know more than is documented.
Nor despite my analogies does this apply only to software developers. One team I worked on lost a long term "technical writer" (that wasn't their full time work but it was a role they filled for that team). This person was the best person I've ever known at documenting things. And still the team reeled at losing them. Even with all the documentation they left behind, knowing what was documented, where and how recently and all sorts of metadata that wasn't captured in the documentation itself went with them when they left. Years later their documentation was still useful, but no one was anywhere near their level of encyclopedic knowledge of it all.
To use an analogy, the New York Yankees perform the best when they focus on building up their roster from their farm system. Long tenured players mentor the new ones, and a cohesive team forms that operates as more than the sum of its parts. They perform the worst when success and money take control and they try to hire "all-stars" away from other teams. The intangibles just aren't there, or aren't as strong. A good company is the same. A long term team built from the ground up is worth far more than a similarly "experienced" team made up of outside hires.
Talent exodus is concerning, so don't give your people a reason to leave. You can't stop it all, but you can do a lot to avoid self inflicted wounds.
I get to live this routinely in consulting projects, where management thinks teams are like playing footbal, replacing players mid game and keep going, naturally it isn't without all sorts of hickups as you well point out.
I think a lot of developers may not appreciate the fleshy human parts of organisations, given most of our efforts to date as an industry are typically to reduce headcount, but staff really are key to any successful organisation.
I have never heard of a control plane before this AWS outage, sounds instructive without needing an explanation but for 1-2 person what does a control plane provide?
“Control plane” and “data plane” are very common terms. It’s just a way to think about responsibilities within a system, not necessarily its physical architecture.
I guess I’m asking the functionality in particular for small teams.
Anyone using AWS isn’t concerned with the physical :)
> Most obviously, RCA has an infinite regress problem
Root cause analysis is just like any other tool. Failure to precisely define the nature of the problem is what usually turns RCA into a wild goose chase. Consider the following problem statements:
"The system feels yucky to use. I don't like it >:("
"POST /reports?option=A is slow around 3pm on Tuesdays"
One of these is more likely to provide a useful RCA that proceeds and halts in a reasonable way.
"AWS went down"
Is not a good starting point for a useful RCA session. "AWS" and "down" being the most volatile terms in this context. What parts of AWS? To what extent were they down? Is the outage intrinsic to each service or due to external factors like DNS?
"EC2 instances in region X became inaccessible to public internet users between Y & Z"
This is the kind of grain I would be doing my PPTX along if I was working at AWS. You can determine that there was a common thread after the fact. Put it in your conclusion / next-steps slide. Starting hyper-specific means that you are less likely to get distracted and can arrive at a good answer much faster. Aggregating the conclusions of many reports, you could then prioritize the strategy for preventing this in the future.
> In normal operation of EC2, the DWFM maintains a large number (~10^6) of active leases against physical servers and a very small number (~10^2) of broken leases, the latter of which the manager is actively attempting to reestablish.
It sounds like fixing broken leases takes a lot more work than renewing functional leases? Funnily enough, there is already an AWS blog post about specifically preventing these types of issues: https://aws.amazon.com/builders-library/reliability-and-cons...
Legitimate question on if the talent exodus from AWS is starting to take its toll. I’m talking about all the senior long-turned folks jumping ship for greener pastures, not the layoffs this week which mostly didn’t touch AWS (folks saying that will happen in future rounds).
The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.
If you average it out over the last decade do we really have more outages now than before? Any complex system with lots of moving parts is bound to fail every so often.
It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.
The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.
1. https://aws.amazon.com/message/41926/
Couldn’t this be explained by natural growth of the amount of cloud resources/data under management?
The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.
Not really. The issue was the time it took to correctly diagnose the issue and then the cascading failures that resulted triggering more lengthy troubleshooting. Rightly or wrongly it plays into the “the folks that knew best how all this works have left the building” vibes. Folks inside AWS say that’s not entirely inaccurate.
Corey Quinn wrote an interesting article addressing that question: https://www.theregister.com/2025/10/20/aws_outage_amazon_bra...
Some good information in the comments as well.
Hope not.. Smooth tech that runs is like the Maytag man.
Tech departments running around with their hair on fire / always looking busy isn't one that always builds trust.
I can't imagine a more uncomfortable place to try and troubleshoot all this than in a hotel lobby surrounded by a dozen coworkers.
Easy: alone, struggling to contact coworkers (while mostly trying to diagnose the problem). I've done both (the alone state didn't last for hours because we did have emergency communication channels, and the hotel was a ski lodge in my case). The surrounded by coworkers is much better.
That's assuming these are actual coworkers, not say a dozen escalating levels of micromanagers, which I agree would be hell. The where isn't really that important in my experience, as long as you've got reliable Internet access.
It wasn't too bad! The annoying bit was that the offsite schedule was delayed for hours for the other ~40 people not working on the issue.
It looks from the public writeup that the thing programming the DNS servers didn't acquire a lease on the server to prevent concurrent access to the same record set. I'd love to see the internal details on that COE.
I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.
> I'd love to see the internal details on that COE.
You'd be unpleasantly surprised, on that point the COE points to the public write-up for details.
Man, AWS is at it again. That big outage on Oct 20 was rough, and now (Oct 29) it looks like things are shaky again. SMH.
us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.
Tbh, for most companies/orgs the cost/complexity of multi region just isn't worth it.
The cost of a work days worth of downtime is rarely enough to justify the expense of trying to deploy across multiple regions or clouds.
Esp if you are public facing and not internal. You just go 'well everyone else was down to because of aws' and your customers just go 'ah okay fair enough'
That sounds like engineering work and expense without a dollar sign attached to it, so maybe it’ll happen after all the product work (i.e. never.)
[dead]
[dead]