INCIDENT · ACTIVE / oncourse-api · production / ap-south-1 / RCA v2 · 2026-06-20

The 500s are a database connection-pool collapse — not traffic.

A handful of daily-plan & content queries hold a pooled DB connection for 5–16 seconds each. They bleed all day; last night, the midnight day-rollover stampede tipped the pool over the edge into a full collapse. Here is the evidence, every query explained, and a validated fix plan.

567
500s in the spike hour
~17×
vs hourly baseline
5–16s
worst query hold time
38
slow queries found & fixed

Root cause · what happened today

Chronic for 10 days — then last night it collapsed

Two facts up front, both from production logs: nothing was deployed today (last release 06-17), and the 500s have been elevated for over a week. What changed last night was not new code — the day-rollover stampede crossed the pool's breaking point.

last deploy 2026-06-17 · 3 days ago daily-plan/timezone fix #907 · 06-16 spike hour 06-19 18:00 UTC ≈ midnight IST 99% of that hour's requests were 500s
500s per day — chronic & risingLoki · production · count_over_time[1d]
Hourly 500s — the cardiac event06-17 → 06-19 · per hour · IST midnight = 18:00–19:00 UTC
hourly 500sday-rollover stampedeprior rollovers (48, 77)

Verdict — why last night specifically

Every prior midnight rollover (06-17: 77, 06-18: 48) stayed just under the pool's usable ceiling. Last night it didn't. At IST midnight, every active user's daily plan rolls over; the first GET /daily-plan/today triggers an ~11s generation query — and there is no idempotency lock, so multi-tab / multi-device users each launch it. That burst pushed concurrent slow queries past the ~55 usable connections, everything queued, the 15s checkout timeout fired, and clients retried — amplifying the load into a 567-error collapse.

It is a tipping point, not a new bug: chronic load + slowly growing query latency + one concentrated rollover burst = nonlinear failure. The fix removes the stampede entirely (idempotency lock + async generation), so the rollover can never collapse the pool again.

The failure, step by step

How one slow query takes down every endpoint

00:00 IST
Rollover
all users' daily plans expire at once
no lock
N × 11s
each first /today fires the generation query — unguarded
pool
~55 → 0
slow holds drain every usable connection
15s
ECHECKOUTTIMEOUT
unrelated requests can't get a connection → 500
client
retry
app retries failed calls → more demand
result
567 / hr
amplifying feedback loop = collapse
Why it hits everything. The pool is shared. Once the daily-plan queries drain it, /badges, /lessons, /streak, flashcards — all fail with the same ECHECKOUTTIMEOUT, even though those queries are fast. They're collateral, not cause.

The core misconception

“How do 150 DAU exhaust 90 connections?”

They don't — by volume. A transaction pooler assumes each query borrows a connection for ~10ms and returns it, so a few connections serve thousands of users. The problem is hold time, not request count.

Connection occupancy1 query = how many “healthy” requests' worth
healthy query ~10msQ1 ~11,000ms ≈ 1,100×
The 90-connection budgetSmall instance · shared
system (~34)app usable (~55)~6 slow queries drain it

One 10-second query occupies a connection as long as ~1,000 healthy requests would. So ~6 users hitting plan-generation in the same window is enough to drain the app's slots — DAU is irrelevant; p99 query latency is the lever.

Evidence-validated · EXPLAIN + hypopg on prod

Every query, root cause & fix

9 query families ranked by total DB time. Each bar shows the measured hold time today vs after the fix. Click any row for the root cause and the exact change. Every latency and index was validated against production with real EXPLAIN (ANALYZE) plans and hypopg hypothetical-index testing.

Cross-cutting amplifier. Q2, Q5, Q6, Q7, Q8, Q9 are called from the mobile app with staleTime: 0 + refetch-on-focus — every screen focus re-fires them, turning cache misses into repeat cold queries. Adding staleTime + refetchOnWindowFocus:false is high-leverage, near-zero risk.

The hidden class you flagged

Deck & flashcard queries via the execute_sql RPC

Profiling the whole execute_sql class turned up a second, larger failure mode — and the reason “deck timeouts” never appeared in the API 500 logs. It also surfaced a security finding.

40+
execute_sql RPC sites across 6 live repos — every one builds SQL by string interpolation, none parameterized
29
are slow (>150ms or 3s-timeout risk) — invisible to slow-query dashboards because literals collapse under one execute_sql entry; found only by reading code
2 roles
anon (3s timeout → cancelled, user sees a timeout) vs postgres (no timeout → holds a pooler connection 5–28s)
⚠ Security — SQL injection (HIGH/CRITICAL). The execute_sql RPC is SECURITY DEFINER, reached as the anon role, and every site interpolates raw values into the query string with no binding. Confirmed live vectors from unvalidated user input on POST /question/filter, GET /flashcards/search, POST /snippet/note/, GET /user/:id/past-tests. RLS + the 3s timeout are the only barriers; cross-table exfiltration and boolean/timing inference remain open. This should be triaged independently of performance.
Worst execute_sql offendersprod EXPLAIN · once-off · off-peak
Query / endpointNowFailure modePrimary fix
getPopulatedTestsForUser · GET /user/:id/past-tests27.8s4M-row cartesian fan-out, no pagination · pooler holdpage ids in a CTE before the joins; tests(user_id, created_at DESC)
getQuestionsWithSemanticSearch · Rezzy quiz-gen tool12sseq scan on question_tags.tag_ididx_question_tags_tag_id (highest-leverage index)
getFlashcardsForScheduleDeckConfig · cron (= Q3)12.7slossy BitmapAnd · pooler holdcovered by Q3 flashcard indexes + ANALYZE
fetchRelated{Pyqs,Questions}WithEmbedding · smartNote (bg)~10stag seq scan + embedding scanidx_question_tags_tag_id; guard null filters
getFlashcardsBySearchCount · /flashcards/search/count8sOR ILIKE '%term%' on 41k rows · anon 3s → timeoutdrop ILIKE → use FTS, or trigram GIN
getSubjectsForPremadeDecksByUserId · /decks/premade/subjects6.5svalidation_status not indexed → 77k heap fetches · anon → deck timeoutidx_flashcards_owner_premade_valid (index-only)
getFlashcardsFromDeck · GET /deck/:id/flashcards5sno LIMIT — returns the entire 13k-card deck · anon → deck timeoutpaginate; covering index
topperDecksByExamIdsSql · /decks/topper, /decks/premade4.9sper-card heap probe on system flashcardsidx_flashcards_sys_validated_id (partial)
getDeckStats · /decks/topper (per deck)120ms ×NN+1 — one RPC per user deck via Promise.all → N concurrent pooler connsbatch: one WHERE deck_id = ANY($ids) GROUP BY
…plus high_yield list (3.1s), flashcard search (3.9s), embedding prompt (4.5s), question-filter count (1.45s), paused-tests (1.27s), notes count (0.95s), and more — 29 in total.

New indexes (distinct from the 11)

  • question_tags(tag_id)highest leverage; table has only a composite PK, so every tag filter seq-scans 119k rows (helps the 12s + two ~10s queries)
  • flashcards(user_id, card_type, validation_status, subject_id, topic_id) — index-only deck/flashcard browse aggregations
  • flashcards(id) WHERE system+validated — partial; kills per-card heap probes (topper/exam decks)
  • tests(user_id, created_at DESC) — the page-first rewrite for all 3 past-test queries
  • notes_meta(topic_id, note_id), notes(user_id, subject_id)

Systemic fixes

  • Parameterize the whole class — replace execute_sql({query}) with $queryRaw + $N binds. One change fixes injection, restores pg_stat_statements visibility, and enables plan caching
  • Input allowlists — UUID-validate ids; enum/column allowlists for ORDER BY, cardType, status, etc.
  • Kill aggregate-then-paginate — push LIMIT to an id-only inner query before json_agg/GROUP BY (past-tests, question-filter, notes)
  • Eliminate N+1s — batch per-deck/per-row RPCs into one query
  • Delete dead code (questions.repository.v1.ts), VACUUM/ANALYZE stale tables

Consolidated remediation

11 indexes (deduped), rewrites & guardrails

Two indexes are shared across queries. All validated with hypopg. CREATE INDEX CONCURRENTLY holds no blocking lock but must run out-of-band (not via prisma migrate, which is transactional).

IndexTableServesEffect
idx_questions_topic_val_coveringquestionsQ2index-only; −42% alone / −67% w/ rewrite
idx_question_tags_tag_questionquestion_tagsQ2kills 42MB external sort
idx_flashcards_user_topic_valid_unsuspflashcardsQ3removes lossy BitmapAnd (~2.6s)
idx_deck_flashcards_flashcard_createddeck_flashcardsQ3MAX nested loop → index-only (~5.2s)
idx_test_questions_test_correct_createdtest_questionsQ7+Q4index-only probe
idx_questions_id_incl_topicquestionsQ1+Q4enables Q4 rewrite index-only
idx_questions_topic_id_idquestionsQ4lessons→questions index-only
idx_lessons_subject_completionlessonsQ8both variants index-only, −72%
idx_exercises_lesson_id_idexercisesQ9removes cold heap reads
idx_test_templates_created_by_created_attest_templatesQ5recent-template exclusion

Sequenced for safety

Stop the bleeding, then fix the structure

Ship tonight — no app release

  1. statement_timeout circuit-breaker. Cap the app role at ~8s so a 12–16s query can't pin a connection. Must verify it's honored through the :6543 pooler; if not, set it app-side via SET LOCAL per heavy query.
  2. ANALYZE + VACUUM. ANALYZE flashcards; ANALYZE deck_flashcards; instantly fixes Q3's bad plan; VACUUM exercise_to_assets helps Q1. Zero risk.
  3. Idempotency lock on plan generation. Advisory lock / ON CONFLICT + Redis SET NX on /regenerate — directly kills last night's rollover stampede.
  4. Build the indexes CONCURRENTLY, off-peak, on the direct connection — CRITICAL/HIGH first.
  5. Frontend caching (OTA). staleTime + refetchOnWindowFocus:false — cuts cache-miss request volume across 6 queries.

Follow-up — needs review & tests

  1. Behavior-preserving rewrites. Q2 PYQ join-filter, Q6 conditional joins (−100×), Q7 drop redundant join, Q5 anti-join CTEs (−95%), Q8/Q9 cleanup + the deck execute_sql fixes.
  2. The big win: async generation. Move daily-plan + content generation off the request path into a Trigger.dev job with a “generating” state. Removes Q1, Q3, Q5 from user requests entirely — and makes the rollover stampede structurally impossible.
  3. Parameterize execute_sql + close injection. Move it off the anon 3s-timeout role, restore pg_stat_statements visibility, eliminate string interpolation.