Missed a few months. Here’s the catch-up.


My recovery after the ACL surgery took longer than expected, so I didn’t return to work until December, and even then I never worked a full week.

But now it’s the new year and we have to start on this grind again.

If you subscribed recently, welcome! I write about AWS and engineering work the way it actually feels, plus the occasional lesson learned the hard way.

Since I’ve been back, the most notable thing my team did while I was out was build a package with a bunch of “Skills” for Claude.

This has been ridiculously useful for debugging (most of the time), creating design docs, and then implementing. Combined with MCP servers, it’s a powerful weapon.

So powerful, in fact, that I’ve barely had to write a single line of code manually since I got back. Now I can just have minions working in the background while I direct them and check their work.

I almost fired one of these minions when it told me it got stuck, then said it gives up. I ended up figuring it out myself and just killed him instead (with Ctrl + C, of course).

Here are a couple other interesting things I did, or lessons I learned, since getting back.


Consistent reads don’t apply to cross-region replication

A few months ago I wrote about DynamoDB consistent reads.

In short, DynamoDB replicates data across partitions and AZs inside a region, and consistent reads can help with read-after-write there.

What you can’t use consistent reads for is “I wrote in one region, so I should be able to read it immediately in another.”

We have a service with an API library that randomly chooses which region to query or write to. I discovered the hard way that if you query too fast after a write, you can lose a race with replication.

In my case, an E2E test created a report and then the website tried to display it right away. It always worked with a 2-second delay, but not immediately. It drove me nuts until I realized what was actually happening.

So if you’re building a multi-region service, don’t randomly choose regions for reads unless you’re fine with eventual consistency. If you need read-your-writes, read and write from the same region.

What I ended up doing was adding a fallback query from the website if the first query returned nothing. Users don’t get a broken experience, and I didn’t have to redesign anything or risk breaking prod.


Measure Seven Times, Cut Once

That’s a Russian proverb. The idea is that it’s better to spend time preparing than trying random things until something works.

This service used to generate reports on the backend when an event completed. Users couldn’t request them manually.

While I was gone, that changed. Now there’s a website that lets users request reports. What nobody spent time answering was the obvious question:

Why do reports always take 3+ minutes to generate?

I had a theory, but decided to measure first so I wouldn’t waste time on optimizations with no ROI.

I spent about 2 days onboarding this service and its dependencies, mostly so I could emit proper logs and metrics. Then I asked Claude to track down the issue. (Yes, I asked it before spending 2 days too. It had theories. It needed data.)

The issue ended up being in a totally different place from where I even imagined.

We had a 3-minute delay on the SQS queue for report generation requests.

That delay existed for a reason. It ensured artifacts had time to upload from the onboarded services.

But it was useless for website-initiated requests.

I needed a way to keep the delay for “fresh” report requests, but skip it for older requests, without changing a bunch of API request/response shapes.

The website already doesn’t allow reports to be generated with an end date less than 10 minutes ago. So I used that.

If the request end time is more than 10 minutes in the past, skip the delay. Otherwise keep it.

I felt really smart. Especially after the smart guy on the team said it was clever.


New toys to try

Here are a few things AWS released not long ago that look worth exploring. I haven’t had time yet, so let me know if you try them.

  • cdk refactor can help you preserve resources when refactoring CDK code. If you’ve used CDK long enough, you know that no matter how hard you try, you eventually end up with messy stacks you want to refactor but can’t, because logical IDs change and resources get destroyed and recreated. (Unless you do ugly magic with CFN outputs/imports, which I hate.)
  • IAM Policy Autopilot MCP server. The amount of times I forgot to add permission to read a GSI index is embarrassing. Hopefully this helps.
  • There are a bunch of other AWS MCP servers that might be useful. Hit reply if you find something that actually saved you time. I want to try it too.

What’s one thing you measured recently that saved you from wasting a week optimizing the wrong problem?

P.S. If you have an android phone and would like to help me test my app before it can be published to Google Play Store, please let me know. All you will have to do is download and open it :)

Cheers!

Evgeny Urubkov (@codevev)

Usual disclaimer: opinions are mine.

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

codevev is a weekly newsletter designed to help you become a better software developer. Every Wednesday, get a concise email packed with value:• Skill Boosts: Elevate your coding with both hard and soft skill insights.• Tool Tips: Learn about new tools and how to use them effectively.• Real-World Wisdom: Gain from my experiences in the tech field.

Read more from codevev

It was my first on-call shift since I’ve been back after surgery. I was also onboarding a new person to be on-call, which is always a fun combo: you’re trying to look calm while quietly hoping nothing explodes. On Wednesday night I went to bed early, around 9pm, trying to catch up on sleep. Of course, my “favorite” sound came from the phone, the pager app. I really didn’t want to get up, so I did the lazy thing: checked which alarm fired through this terrible app we have to use, saw it wasn’t...

A couple weeks ago I wrote about making our reports take a couple seconds instead of 3 minutes. What I discovered later is that we didn’t actually have access to historical reports, because all the DynamoDB entries that pointed to the S3 data behind those reports had a TTL of one day. After asking around, the reason was simple: some partition keys were exceeding 10GB, and that’s the DynamoDB item collection limit per partition key (aka “all items with the same partition key”). So the...

My friend asked me yesterday if I know what AWS is. And it’s not someone I only talk to once in a while. We literally talk every day. I guess I always refer to my employer as just Amazon, so “AWS” never comes up. He recently acquired an app and needed to create a new AWS account, add a new user for his developer, and give him permissions for Lightsail (whatever that is). He managed to do the first two. The permissions part? Yep, he had no idea. I’ve written about IAM before here. My goal...