It was my first on-call shift since I’ve been back after surgery. I was also onboarding a new person to be on-call, which is always a fun combo: you’re trying to look calm while quietly hoping nothing explodes.
On Wednesday night I went to bed early, around 9pm, trying to catch up on sleep. Of course, my “favorite” sound came from the phone, the pager app.
I really didn’t want to get up, so I did the lazy thing: checked which alarm fired through this terrible app we have to use, saw it wasn’t important (and the alarm threshold was probably not calibrated, or maybe the alarm shouldn’t exist in the first place), bumped the threshold up, and went back to bed.
I couldn’t fall asleep anymore. I did something useless like playing chess or watching something. I don’t even remember.
Eventually I fell asleep again, but the pager had a different plan for me.
Midnight pages and self-healing canaries
Around midnight, it went off again.
I was hoping it was the same alarm so I could just remove it from the parent alarm and go back to sleep. Nope. This time it was canaries failing (synthetic prod transactions that constantly hit your service to make sure it still works).
By the time I got to my computer, the alarm was already green.
Which is almost worse, because now you’re stuck doing the “it recovered, but why?” investigation. If you ignore it and it happens again, you’ll hate yourself. If you chase it and it was a one-off blip, you also hate yourself, just for different reasons.
The canary transactions weren’t even starting. Interesting.
So I started digging.
The first clue was a lie
The errors in the service Lambda logs looked like this:
java.lang.IllegalStateException: Connection pool shut down
A “connection pool” is just reused HTTP connections. Instead of opening a brand new connection to AWS for every request, the SDK keeps a pool and reuses them. Faster and cheaper. Until it isn’t.
If you see that error, your first instinct is probably the same as mine: “connection pool issue.” Limits. Too many concurrent connections. Timeouts. Something about the Apache HTTP client.
So I started looking in all the wrong places.
The weird part was that once a Lambda container started throwing this error, it kept throwing it for subsequent requests that landed on that same container.
At that time of night, the only “customers” were the one customer causing the page, plus our canaries. So it looked like everything was broken, even though we weren’t actually serving other real users right then. If this happened during the day, it could have spread wider.
And then I noticed something: earlier in the same container logs, we had OOM errors (out-of-memory, meaning the Java heap filled up and the process couldn’t allocate more objects).
What actually happened
Here’s the timeline I saw in a single Lambda container:
- 23:42–23:57, OOM errors while processing a large job
- 00:03:40, first “Connection pool shut down” error appears
- 00:03+, subsequent requests fail with the same error
That “connection pool shut down” message is a red herring. It can be the result of OOM.
According to the AWS SDK docs, java.lang.Errors (including OutOfMemoryError) can cause the Apache HTTP connection pool to shut down, which can later surface as Connection pool shut down when the SDK tries to use it.
So the actual story was:
- Customer A uploads a massive workload
- Our Lambda container runs out of heap memory
- Somewhere in the SDK / HTTP client, the connection pool gets shut down
- Lambda keeps the container warm (Lambda reuses the same runtime/container for multiple invocations to avoid cold starts)
- The SDK client is a singleton, so it persists across invocations (created once, reused for every request in that container)
- Canaries (and any other traffic that happens to land on that container) hit the dead connection pool
- Everything fails until Lambda eventually recycles that container (in our case, ~30 minutes)
The part that annoyed me the most is that Lambda didn’t “just kill the container.”
In our case, the OOM happened in a background thread, not the main handler thread. The handler returned (possibly with error), but Lambda considered the container eligible to keep warm. So it stayed alive and kept accepting traffic like nothing happened.
Root cause: we bounded connections, not memory
Now for the fun part: why did we OOM?
This Lambda downloads thousands of files from S3 and then merges them into a report. It uses virtual threads (lightweight threads that make it easy to run lots of concurrent work) and a concurrency limit, so at a glance it looks “safe.”
The original logic was basically:
Download from S3 (limited to 500 concurrent) -> queue in memory -> merge (up to 20,000 queued)
That queue was literally holding byte arrays.
Worst case memory math looked like this:
- Up to 20,000 byte arrays in the queue
- Average file size around 400KB
- That’s around 8GB sitting in memory (plus overhead)
- Lambda memory was set to ~10GB
So for jobs with 25k+ files, we’d eventually exhaust heap and trigger OOM.
And then the connection pool would shut down. And then the container would poison itself. And then canaries would fail for a while.
The fix: hold the semaphore longer
The fix was boring, which is usually a good sign.
Before, we released the “download semaphore” immediately after download, and then allowed items to pile up in memory waiting for merge.
A semaphore is basically a counter that limits how many things can happen at once. If the counter is 300, only 300 tasks can proceed. Everyone else waits.
After, we held the semaphore through the merge step, so “in-flight downloads” also represented “in-flight memory.”
Before:
- Acquire download semaphore (limits S3 connections)
- Download file
- Release download semaphore (queue can grow)
- Acquire merge semaphore (queue can grow up to 20k)
- Merge
After:
- Acquire download semaphore (limits S3 connections and memory)
- Download file
- Merge
- Release download semaphore
Memory impact:
- Before: up to 20,000 byte arrays queued (around 8GB)
- After: bounded by MAX_CONCURRENT_DOWNLOADS (we used around 300), so memory stays on the order of hundreds of MB, depending on file size and overhead
Yes, throughput is lower. No, I don’t care. I like sleeping.
What I learned (again)
- “Connection pool shut down” can be an OOM symptom, not a connection pool problem.
- If a Lambda container gets “poisoned,” warm reuse can spread that failure to unrelated requests that land on the same container.
- Singleton SDK clients are great until they aren’t. If they die, they die for everyone in that container.
- With virtual threads, it’s easy to make something “concurrent” without realizing you also made it “unbounded memory.”
- Semaphores only help if you hold them through the entire lifecycle of the resource you’re trying to protect, not just the I/O.
Your turn
What’s the most misleading error message you’ve chased during on-call?
Cheers!
Evgeny Urubkov (@codevev)