It Wasn’t My Fault!


Well, the global AWS outage happened just four days after I sent a newsletter about COEs and how “nobody gets blamed.”

Great timing, right?

I wish I could’ve been in the weekly global ops meeting to see the temperature in the room. That’s the one where teams present their recent issues and learnings. I can only imagine how lively that one must’ve been.

Turns out the culprit was a DNS failure in the Amazon DynamoDB endpoint in the us-east-1 region.

And while that sounds region-specific, it actually affected a bunch of global services - like IAM - because they depend on control-plane endpoints in that region.

I’ve talked before about availability zones and regional redundancy, but it looks like there was no escape from this one. Unless you’re running a multi-cloud app. But that’s too much for me to even think about right now.

Can’t wait to read the COE and see what “actually” happened once it’s published.

For the record - I had nothing to do with it!

I’m still recovering from ACL surgery, and between PT, doctor visits, and my wife’s surgery, I’ve had my own kind of incident response to deal with.

That said, I did come across a great post from one of the folks at Cline about their new CLI. The author built it so they could run multiple agents directly from the terminal - pretty cool if you’re into automation or agent frameworks. You can read it here:

👉 Cline CLI: My Undying Love of Cline Core

Maybe one day I’ll have an agent that can handle my on-call rotation…

Cheers!

Evgeny Urubkov (@codevev)

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

codevev is a weekly newsletter designed to help you become a better software developer. Every Wednesday, get a concise email packed with value:• Skill Boosts: Elevate your coding with both hard and soft skill insights.• Tool Tips: Learn about new tools and how to use them effectively.• Real-World Wisdom: Gain from my experiences in the tech field.

Read more from codevev

It was my first on-call shift since I’ve been back after surgery. I was also onboarding a new person to be on-call, which is always a fun combo: you’re trying to look calm while quietly hoping nothing explodes. On Wednesday night I went to bed early, around 9pm, trying to catch up on sleep. Of course, my “favorite” sound came from the phone, the pager app. I really didn’t want to get up, so I did the lazy thing: checked which alarm fired through this terrible app we have to use, saw it wasn’t...

A couple weeks ago I wrote about making our reports take a couple seconds instead of 3 minutes. What I discovered later is that we didn’t actually have access to historical reports, because all the DynamoDB entries that pointed to the S3 data behind those reports had a TTL of one day. After asking around, the reason was simple: some partition keys were exceeding 10GB, and that’s the DynamoDB item collection limit per partition key (aka “all items with the same partition key”). So the...

My friend asked me yesterday if I know what AWS is. And it’s not someone I only talk to once in a while. We literally talk every day. I guess I always refer to my employer as just Amazon, so “AWS” never comes up. He recently acquired an app and needed to create a new AWS account, add a new user for his developer, and give him permissions for Lightsail (whatever that is). He managed to do the first two. The permissions part? Yep, he had no idea. I’ve written about IAM before here. My goal...