OnCourse · Production 500s — Incident RCA & Fix Plan

Root cause · what happened today

Chronic for 10 days — then last night it collapsed

Two facts up front, both from production logs: nothing was deployed today (last release 06-17), and the 500s have been elevated for over a week. What changed last night was not new code — the day-rollover stampede crossed the pool's breaking point.

last deploy 2026-06-17 · 3 days ago daily-plan/timezone fix #907 · 06-16 spike hour 06-19 18:00 UTC ≈ midnight IST 99% of that hour's requests were 500s

500s per day — chronic & risingLoki · production · count_over_time[1d]

Hourly 500s — the cardiac event06-17 → 06-19 · per hour · IST midnight = 18:00–19:00 UTC

hourly 500sday-rollover stampedeprior rollovers (48, 77)

Verdict — why last night specifically

Every prior midnight rollover (06-17: 77, 06-18: 48) stayed just under the pool's usable ceiling. Last night it didn't. At IST midnight, every active user's daily plan rolls over; the first GET /daily-plan/today triggers an ~11s generation query — and there is no idempotency lock, so multi-tab / multi-device users each launch it. That burst pushed concurrent slow queries past the ~55 usable connections, everything queued, the 15s checkout timeout fired, and clients retried — amplifying the load into a 567-error collapse.

It is a tipping point, not a new bug: chronic load + slowly growing query latency + one concentrated rollover burst = nonlinear failure. The fix removes the stampede entirely (idempotency lock + async generation), so the rollover can never collapse the pool again.

The failure, step by step

How one slow query takes down every endpoint

00:00 IST

Rollover

all users' daily plans expire at once

→

no lock

N × 11s

each first /today fires the generation query — unguarded

→

pool

~55 → 0

slow holds drain every usable connection

15s

ECHECKOUTTIMEOUT

unrelated requests can't get a connection → 500

→

client

retry

app retries failed calls → more demand

↻

result

567 / hr

amplifying feedback loop = collapse

Why it hits everything. The pool is shared. Once the daily-plan queries drain it, /badges, /lessons, /streak, flashcards — all fail with the same ECHECKOUTTIMEOUT, even though those queries are fast. They're collateral, not cause.

The core misconception

“How do 150 DAU exhaust 90 connections?”

They don't — by volume. A transaction pooler assumes each query borrows a connection for ~10ms and returns it, so a few connections serve thousands of users. The problem is hold time, not request count.

Connection occupancy1 query = how many “healthy” requests' worth

healthy query ~10msQ1 ~11,000ms ≈ 1,100×

The 90-connection budgetSmall instance · shared

system (~34)app usable (~55)~6 slow queries drain it

One 10-second query occupies a connection as long as ~1,000 healthy requests would. So ~6 users hitting plan-generation in the same window is enough to drain the app's slots — DAU is irrelevant; p99 query latency is the lever.

Evidence-validated · EXPLAIN + hypopg on prod

Every query, root cause & fix

9 query families ranked by total DB time. Each bar shows the measured hold time today vs after the fix. Click any row for the root cause and the exact change. Every latency and index was validated against production with real EXPLAIN (ANALYZE) plans and hypopg hypothetical-index testing.

Cross-cutting amplifier. Q2, Q5, Q6, Q7, Q8, Q9 are called from the mobile app with staleTime: 0 + refetch-on-focus — every screen focus re-fires them, turning cache misses into repeat cold queries. Adding staleTime + refetchOnWindowFocus:false is high-leverage, near-zero risk.

The hidden class you flagged

Deck & flashcard queries via the `execute_sql` RPC

Profiling the whole execute_sql class turned up a second, larger failure mode — and the reason “deck timeouts” never appeared in the API 500 logs. It also surfaced a security finding.

40+

execute_sql RPC sites across 6 live repos — every one builds SQL by string interpolation, none parameterized

are slow (>150ms or 3s-timeout risk) — invisible to slow-query dashboards because literals collapse under one execute_sql entry; found only by reading code

2 roles

anon (3s timeout → cancelled, user sees a timeout) vs postgres (no timeout → holds a pooler connection 5–28s)

⚠ Security — SQL injection (HIGH/CRITICAL). The execute_sql RPC is SECURITY DEFINER, reached as the anon role, and every site interpolates raw values into the query string with no binding. Confirmed live vectors from unvalidated user input on POST /question/filter, GET /flashcards/search, POST /snippet/note/, GET /user/:id/past-tests. RLS + the 3s timeout are the only barriers; cross-table exfiltration and boolean/timing inference remain open. This should be triaged independently of performance.

Worst execute_sql offendersprod EXPLAIN · once-off · off-peak

Query / endpoint	Now	Failure mode	Primary fix
getPopulatedTestsForUser · GET /user/:id/past-tests	27.8s	4M-row cartesian fan-out, no pagination · pooler hold	page ids in a CTE before the joins; `tests(user_id, created_at DESC)`
getQuestionsWithSemanticSearch · Rezzy quiz-gen tool	12s	seq scan on `question_tags.tag_id`	`idx_question_tags_tag_id` (highest-leverage index)
getFlashcardsForScheduleDeckConfig · cron (= Q3)	12.7s	lossy BitmapAnd · pooler hold	covered by Q3 flashcard indexes + ANALYZE
fetchRelated{Pyqs,Questions}WithEmbedding · smartNote (bg)	~10s	tag seq scan + embedding scan	`idx_question_tags_tag_id`; guard null filters
getFlashcardsBySearchCount · /flashcards/search/count	8s	`OR ILIKE '%term%'` on 41k rows · anon 3s → timeout	drop ILIKE → use FTS, or trigram GIN
getSubjectsForPremadeDecksByUserId · /decks/premade/subjects	6.5s	`validation_status` not indexed → 77k heap fetches · anon → deck timeout	`idx_flashcards_owner_premade_valid` (index-only)
getFlashcardsFromDeck · GET /deck/:id/flashcards	5s	no LIMIT — returns the entire 13k-card deck · anon → deck timeout	paginate; covering index
topperDecksByExamIdsSql · /decks/topper, /decks/premade	4.9s	per-card heap probe on system flashcards	`idx_flashcards_sys_validated_id` (partial)
getDeckStats · /decks/topper (per deck)	120ms ×N	N+1 — one RPC per user deck via `Promise.all` → N concurrent pooler conns	batch: one `WHERE deck_id = ANY($ids) GROUP BY`

…plus high_yield list (3.1s), flashcard search (3.9s), embedding prompt (4.5s), question-filter count (1.45s), paused-tests (1.27s), notes count (0.95s), and more — 29 in total.

New indexes (distinct from the 11)

question_tags(tag_id) — highest leverage; table has only a composite PK, so every tag filter seq-scans 119k rows (helps the 12s + two ~10s queries)
flashcards(user_id, card_type, validation_status, subject_id, topic_id) — index-only deck/flashcard browse aggregations
flashcards(id) WHERE system+validated — partial; kills per-card heap probes (topper/exam decks)
tests(user_id, created_at DESC) — the page-first rewrite for all 3 past-test queries
notes_meta(topic_id, note_id), notes(user_id, subject_id)

Systemic fixes

Parameterize the whole class — replace execute_sql({query}) with $queryRaw + $N binds. One change fixes injection, restores pg_stat_statements visibility, and enables plan caching
Input allowlists — UUID-validate ids; enum/column allowlists for ORDER BY, cardType, status, etc.
Kill aggregate-then-paginate — push LIMIT to an id-only inner query before json_agg/GROUP BY (past-tests, question-filter, notes)
Eliminate N+1s — batch per-deck/per-row RPCs into one query
Delete dead code (questions.repository.v1.ts), VACUUM/ANALYZE stale tables

Consolidated remediation

11 indexes (deduped), rewrites & guardrails

Two indexes are shared across queries. All validated with hypopg. CREATE INDEX CONCURRENTLY holds no blocking lock but must run out-of-band (not via prisma migrate, which is transactional).

Index	Table	Serves	Effect
idx_questions_topic_val_covering	questions	Q2	index-only; −42% alone / −67% w/ rewrite
idx_question_tags_tag_question	question_tags	Q2	kills 42MB external sort
idx_flashcards_user_topic_valid_unsusp	flashcards	Q3	removes lossy BitmapAnd (~2.6s)
idx_deck_flashcards_flashcard_created	deck_flashcards	Q3	MAX nested loop → index-only (~5.2s)
idx_test_questions_test_correct_created	test_questions	Q7+Q4	index-only probe
idx_questions_id_incl_topic	questions	Q1+Q4	enables Q4 rewrite index-only
idx_questions_topic_id_id	questions	Q4	lessons→questions index-only
idx_lessons_subject_completion	lessons	Q8	both variants index-only, −72%
idx_exercises_lesson_id_id	exercises	Q9	removes cold heap reads
idx_test_templates_created_by_created_at	test_templates	Q5	recent-template exclusion

Sequenced for safety

Stop the bleeding, then fix the structure

Ship tonight — no app release

statement_timeout circuit-breaker. Cap the app role at ~8s so a 12–16s query can't pin a connection. Must verify it's honored through the :6543 pooler; if not, set it app-side via SET LOCAL per heavy query.
ANALYZE + VACUUM. ANALYZE flashcards; ANALYZE deck_flashcards; instantly fixes Q3's bad plan; VACUUM exercise_to_assets helps Q1. Zero risk.
Idempotency lock on plan generation. Advisory lock / ON CONFLICT + Redis SET NX on /regenerate — directly kills last night's rollover stampede.
Build the indexes CONCURRENTLY, off-peak, on the direct connection — CRITICAL/HIGH first.
Frontend caching (OTA). staleTime + refetchOnWindowFocus:false — cuts cache-miss request volume across 6 queries.

Follow-up — needs review & tests

Behavior-preserving rewrites. Q2 PYQ join-filter, Q6 conditional joins (−100×), Q7 drop redundant join, Q5 anti-join CTEs (−95%), Q8/Q9 cleanup + the deck execute_sql fixes.
The big win: async generation. Move daily-plan + content generation off the request path into a Trigger.dev job with a “generating” state. Removes Q1, Q3, Q5 from user requests entirely — and makes the rollover stampede structurally impossible.
Parameterize execute_sql + close injection. Move it off the anon 3s-timeout role, restore pg_stat_statements visibility, eliminate string interpolation.

Chronic for 10 days — then last night it collapsed

Verdict — why last night specifically

How one slow query takes down every endpoint

“How do 150 DAU exhaust 90 connections?”

Every query, root cause & fix

Deck & flashcard queries via the execute_sql RPC

New indexes (distinct from the 11)

Systemic fixes

11 indexes (deduped), rewrites & guardrails

Stop the bleeding, then fix the structure

Deck & flashcard queries via the `execute_sql` RPC