We were deleting history to avoid a DynamoDB limit


A couple weeks ago I wrote about making our reports take a couple seconds instead of 3 minutes. What I discovered later is that we didn’t actually have access to historical reports, because all the DynamoDB entries that pointed to the S3 data behind those reports had a TTL of one day. After asking around, the reason was simple: some partition keys were exceeding 10GB, and that’s the DynamoDB item collection limit per partition key (aka “all items with the same partition key”). So the “solution” was to delete everything fast enough that we never hit the limit. Great.

So I created a task for myself to fix it properly. I started by asking AI for the best approaches, and it basically gave me “sharding” plus a bunch of nonsense that didn’t apply. Which, to be fair, is also what humans do when they don’t want to think about DynamoDB.


Sharding (what it actually means)

Sharding in DynamoDB is basically admitting that your partition key is doing too much work, then splitting it into multiple keys so DynamoDB can spread the data across more partitions.

Instead of:

  • PK = customerId

You do:

  • PK = customerId#shard_00
  • PK = customerId#shard_01
  • PK = customerId#shard_N

And you pick the shard deterministically, usually based on something like a hash, or a time bucket, or both. The goal is that you don’t end up with one key that holds infinite data and gets hammered, while the rest of the table sits there bored.

I wasn’t a fan of it at first because reads become “scatter-gather.” If you need a month of data and you shard by day, that’s 30 queries. If you shard by hash, you might need N queries to cover all shards. Either way, you’re paying with complexity and more round trips.


The “recommended” time-series approach (and why I didn’t do it)

Then I found AWS’s recommended time-series design.

It does make sense: you bucket time so one partition key doesn’t grow forever. But once you implement it, you still end up in “multiple queries” land, and you also end up needing more glue logic around it (how do we pick buckets, how do we query ranges, how do we backfill, how do we avoid edge cases, etc).

Also, depending on which version of the pattern you choose, you may end up needing scheduled workflows or background jobs to manage things. And of course those can fail. I didn’t want to introduce more “scheduler + automation + permissions + deploy” complexity unless I had to.

So I decided against it.


What I really wanted to do (Glue + Athena)

What I really wanted was Glue + Athena for this. It just makes more sense for historical/reporting-ish queries, especially when the data already lives in S3 and you can use partitions to restrict the amount of data scanned instead of trying to force DynamoDB to be a data lake.

The problem is that our access patterns weren’t just “by primary key.” We also rely on GSIs, and once you go down the route of “let’s rebuild this access pattern on top of S3,” you’re basically signing up for a lot more infrastructure. And I’ve already suffered enough doing permissions for Glue jobs in CDK to avoid creating new Glue-based pipelines unless it’s absolutely necessary.

So I didn’t do that either.


So… I ended up with sharding

Yes, doing N queries for each time range is annoying. But we can parallelize it, and we can also put reasonable limits on what users can query (last month, last 3 months, etc). And the actual implementation didn’t take long.

The trickiest part was the transition. I didn’t want to break anything, especially with TTL involved.

So I pushed a change with dual-reads first (read from the old non-sharded key and the new sharded keys). After that was deployed everywhere, I pushed the change to start doing the writes to the sharded structure. I haven’t been paged yet, so that’s a success.

Next step is deleting the non-sharded reads, but I’m waiting a few days to make sure TTL expires the old items and we’re not depending on them anywhere. The hard part is behind me.

Your turn

What DynamoDB limitations have you run into that forced you to redesign something?

Cheers!

Evgeny Urubkov (@codevev)

600 1st Ave, Ste 330 PMB 92768, Seattle, WA 98104-2246
Unsubscribe · Preferences

codevev

codevev is a weekly newsletter designed to help you become a better software developer. Every Wednesday, get a concise email packed with value:• Skill Boosts: Elevate your coding with both hard and soft skill insights.• Tool Tips: Learn about new tools and how to use them effectively.• Real-World Wisdom: Gain from my experiences in the tech field.

Read more from codevev

It was my first on-call shift since I’ve been back after surgery. I was also onboarding a new person to be on-call, which is always a fun combo: you’re trying to look calm while quietly hoping nothing explodes. On Wednesday night I went to bed early, around 9pm, trying to catch up on sleep. Of course, my “favorite” sound came from the phone, the pager app. I really didn’t want to get up, so I did the lazy thing: checked which alarm fired through this terrible app we have to use, saw it wasn’t...

My friend asked me yesterday if I know what AWS is. And it’s not someone I only talk to once in a while. We literally talk every day. I guess I always refer to my employer as just Amazon, so “AWS” never comes up. He recently acquired an app and needed to create a new AWS account, add a new user for his developer, and give him permissions for Lightsail (whatever that is). He managed to do the first two. The permissions part? Yep, he had no idea. I’ve written about IAM before here. My goal...

My recovery after the ACL surgery took longer than expected, so I didn’t return to work until December, and even then I never worked a full week. But now it’s the new year and we have to start on this grind again. If you subscribed recently, welcome! I write about AWS and engineering work the way it actually feels, plus the occasional lesson learned the hard way. Since I’ve been back, the most notable thing my team did while I was out was build a package with a bunch of “Skills” for Claude....