A handful of daily-plan & content queries hold a pooled DB connection for 5–16 seconds each. They bleed all day; last night, the midnight day-rollover stampede tipped the pool over the edge into a full collapse. Here is the evidence, every query explained, and a validated fix plan.
Root cause · what happened today
Two facts up front, both from production logs: nothing was deployed today (last release 06-17), and the 500s have been elevated for over a week. What changed last night was not new code — the day-rollover stampede crossed the pool's breaking point.
Every prior midnight rollover (06-17: 77, 06-18: 48) stayed just under the pool's usable ceiling. Last night it didn't. At IST midnight, every active user's daily plan rolls over; the first GET /daily-plan/today triggers an ~11s generation query — and there is no idempotency lock, so multi-tab / multi-device users each launch it. That burst pushed concurrent slow queries past the ~55 usable connections, everything queued, the 15s checkout timeout fired, and clients retried — amplifying the load into a 567-error collapse.
It is a tipping point, not a new bug: chronic load + slowly growing query latency + one concentrated rollover burst = nonlinear failure. The fix removes the stampede entirely (idempotency lock + async generation), so the rollover can never collapse the pool again.
The failure, step by step
/today fires the generation query — unguarded/badges, /lessons, /streak, flashcards — all fail with the same ECHECKOUTTIMEOUT, even though those queries are fast. They're collateral, not cause.The core misconception
They don't — by volume. A transaction pooler assumes each query borrows a connection for ~10ms and returns it, so a few connections serve thousands of users. The problem is hold time, not request count.
One 10-second query occupies a connection as long as ~1,000 healthy requests would. So ~6 users hitting plan-generation in the same window is enough to drain the app's slots — DAU is irrelevant; p99 query latency is the lever.
Evidence-validated · EXPLAIN + hypopg on prod
9 query families ranked by total DB time. Each bar shows the measured hold time today vs after the fix. Click any row for the root cause and the exact change. Every latency and index was validated against production with real EXPLAIN (ANALYZE) plans and hypopg hypothetical-index testing.
staleTime: 0 + refetch-on-focus — every screen focus re-fires them, turning cache misses into repeat cold queries. Adding staleTime + refetchOnWindowFocus:false is high-leverage, near-zero risk.The hidden class you flagged
execute_sql RPCProfiling the whole execute_sql class turned up a second, larger failure mode — and the reason “deck timeouts” never appeared in the API 500 logs. It also surfaced a security finding.
execute_sql RPC sites across 6 live repos — every one builds SQL by string interpolation, none parameterizedexecute_sql entry; found only by reading codeanon (3s timeout → cancelled, user sees a timeout) vs postgres (no timeout → holds a pooler connection 5–28s)execute_sql RPC is SECURITY DEFINER, reached as the anon role, and every site interpolates raw values into the query string with no binding. Confirmed live vectors from unvalidated user input on POST /question/filter, GET /flashcards/search, POST /snippet/note/, GET /user/:id/past-tests. RLS + the 3s timeout are the only barriers; cross-table exfiltration and boolean/timing inference remain open. This should be triaged independently of performance.execute_sql offenders| Query / endpoint | Now | Failure mode | Primary fix |
|---|---|---|---|
| getPopulatedTestsForUser · GET /user/:id/past-tests | 27.8s | 4M-row cartesian fan-out, no pagination · pooler hold | page ids in a CTE before the joins; tests(user_id, created_at DESC) |
| getQuestionsWithSemanticSearch · Rezzy quiz-gen tool | 12s | seq scan on question_tags.tag_id | idx_question_tags_tag_id (highest-leverage index) |
| getFlashcardsForScheduleDeckConfig · cron (= Q3) | 12.7s | lossy BitmapAnd · pooler hold | covered by Q3 flashcard indexes + ANALYZE |
| fetchRelated{Pyqs,Questions}WithEmbedding · smartNote (bg) | ~10s | tag seq scan + embedding scan | idx_question_tags_tag_id; guard null filters |
| getFlashcardsBySearchCount · /flashcards/search/count | 8s | OR ILIKE '%term%' on 41k rows · anon 3s → timeout | drop ILIKE → use FTS, or trigram GIN |
| getSubjectsForPremadeDecksByUserId · /decks/premade/subjects | 6.5s | validation_status not indexed → 77k heap fetches · anon → deck timeout | idx_flashcards_owner_premade_valid (index-only) |
| getFlashcardsFromDeck · GET /deck/:id/flashcards | 5s | no LIMIT — returns the entire 13k-card deck · anon → deck timeout | paginate; covering index |
| topperDecksByExamIdsSql · /decks/topper, /decks/premade | 4.9s | per-card heap probe on system flashcards | idx_flashcards_sys_validated_id (partial) |
| getDeckStats · /decks/topper (per deck) | 120ms ×N | N+1 — one RPC per user deck via Promise.all → N concurrent pooler conns | batch: one WHERE deck_id = ANY($ids) GROUP BY |
question_tags(tag_id) — highest leverage; table has only a composite PK, so every tag filter seq-scans 119k rows (helps the 12s + two ~10s queries)flashcards(user_id, card_type, validation_status, subject_id, topic_id) — index-only deck/flashcard browse aggregationsflashcards(id) WHERE system+validated — partial; kills per-card heap probes (topper/exam decks)tests(user_id, created_at DESC) — the page-first rewrite for all 3 past-test queriesnotes_meta(topic_id, note_id), notes(user_id, subject_id)execute_sql({query}) with $queryRaw + $N binds. One change fixes injection, restores pg_stat_statements visibility, and enables plan cachingLIMIT to an id-only inner query before json_agg/GROUP BY (past-tests, question-filter, notes)questions.repository.v1.ts), VACUUM/ANALYZE stale tablesConsolidated remediation
Two indexes are shared across queries. All validated with hypopg. CREATE INDEX CONCURRENTLY holds no blocking lock but must run out-of-band (not via prisma migrate, which is transactional).
| Index | Table | Serves | Effect |
|---|---|---|---|
| idx_questions_topic_val_covering | questions | Q2 | index-only; −42% alone / −67% w/ rewrite |
| idx_question_tags_tag_question | question_tags | Q2 | kills 42MB external sort |
| idx_flashcards_user_topic_valid_unsusp | flashcards | Q3 | removes lossy BitmapAnd (~2.6s) |
| idx_deck_flashcards_flashcard_created | deck_flashcards | Q3 | MAX nested loop → index-only (~5.2s) |
| idx_test_questions_test_correct_created | test_questions | index-only probe | |
| idx_questions_id_incl_topic | questions | enables Q4 rewrite index-only | |
| idx_questions_topic_id_id | questions | Q4 | lessons→questions index-only |
| idx_lessons_subject_completion | lessons | Q8 | both variants index-only, −72% |
| idx_exercises_lesson_id_id | exercises | Q9 | removes cold heap reads |
| idx_test_templates_created_by_created_at | test_templates | Q5 | recent-template exclusion |
Sequenced for safety
Ship tonight — no app release
SET LOCAL per heavy query.ANALYZE flashcards; ANALYZE deck_flashcards; instantly fixes Q3's bad plan; VACUUM exercise_to_assets helps Q1. Zero risk.ON CONFLICT + Redis SET NX on /regenerate — directly kills last night's rollover stampede.staleTime + refetchOnWindowFocus:false — cuts cache-miss request volume across 6 queries.Follow-up — needs review & tests
execute_sql fixes.execute_sql + close injection. Move it off the anon 3s-timeout role, restore pg_stat_statements visibility, eliminate string interpolation.