My recovery after the ACL surgery took longer than expected, so I didn’t return to work until December, and even then I never worked a full week.
But now it’s the new year and we have to start on this grind again.
If you subscribed recently, welcome! I write about AWS and engineering work the way it actually feels, plus the occasional lesson learned the hard way.
Since I’ve been back, the most notable thing my team did while I was out was build a package with a bunch of “Skills” for Claude.
This has been ridiculously useful for debugging (most of the time), creating design docs, and then implementing. Combined with MCP servers, it’s a powerful weapon.
So powerful, in fact, that I’ve barely had to write a single line of code manually since I got back. Now I can just have minions working in the background while I direct them and check their work.
I almost fired one of these minions when it told me it got stuck, then said it gives up. I ended up figuring it out myself and just killed him instead (with Ctrl + C, of course).
Here are a couple other interesting things I did, or lessons I learned, since getting back.
Consistent reads don’t apply to cross-region replication
A few months ago I wrote about DynamoDB consistent reads.
In short, DynamoDB replicates data across partitions and AZs inside a region, and consistent reads can help with read-after-write there.
What you can’t use consistent reads for is “I wrote in one region, so I should be able to read it immediately in another.”
We have a service with an API library that randomly chooses which region to query or write to. I discovered the hard way that if you query too fast after a write, you can lose a race with replication.
In my case, an E2E test created a report and then the website tried to display it right away. It always worked with a 2-second delay, but not immediately. It drove me nuts until I realized what was actually happening.
So if you’re building a multi-region service, don’t randomly choose regions for reads unless you’re fine with eventual consistency. If you need read-your-writes, read and write from the same region.
What I ended up doing was adding a fallback query from the website if the first query returned nothing. Users don’t get a broken experience, and I didn’t have to redesign anything or risk breaking prod.
Measure Seven Times, Cut Once
That’s a Russian proverb. The idea is that it’s better to spend time preparing than trying random things until something works.
This service used to generate reports on the backend when an event completed. Users couldn’t request them manually.
While I was gone, that changed. Now there’s a website that lets users request reports. What nobody spent time answering was the obvious question:
Why do reports always take 3+ minutes to generate?
I had a theory, but decided to measure first so I wouldn’t waste time on optimizations with no ROI.
I spent about 2 days onboarding this service and its dependencies, mostly so I could emit proper logs and metrics. Then I asked Claude to track down the issue. (Yes, I asked it before spending 2 days too. It had theories. It needed data.)
The issue ended up being in a totally different place from where I even imagined.
We had a 3-minute delay on the SQS queue for report generation requests.
That delay existed for a reason. It ensured artifacts had time to upload from the onboarded services.
But it was useless for website-initiated requests.
I needed a way to keep the delay for “fresh” report requests, but skip it for older requests, without changing a bunch of API request/response shapes.
The website already doesn’t allow reports to be generated with an end date less than 10 minutes ago. So I used that.
If the request end time is more than 10 minutes in the past, skip the delay. Otherwise keep it.
I felt really smart. Especially after the smart guy on the team said it was clever.
New toys to try
Here are a few things AWS released not long ago that look worth exploring. I haven’t had time yet, so let me know if you try them.
- cdk refactor can help you preserve resources when refactoring CDK code. If you’ve used CDK long enough, you know that no matter how hard you try, you eventually end up with messy stacks you want to refactor but can’t, because logical IDs change and resources get destroyed and recreated. (Unless you do ugly magic with CFN outputs/imports, which I hate.)
- IAM Policy Autopilot MCP server. The amount of times I forgot to add permission to read a GSI index is embarrassing. Hopefully this helps.
- There are a bunch of other AWS MCP servers that might be useful. Hit reply if you find something that actually saved you time. I want to try it too.
What’s one thing you measured recently that saved you from wasting a week optimizing the wrong problem?
P.S. If you have an android phone and would like to help me test my app before it can be published to Google Play Store, please let me know. All you will have to do is download and open it :)
Cheers!
Evgeny Urubkov (@codevev)
Usual disclaimer: opinions are mine.