OnCourse · The Permanent Fix — Scalable Architecture

One inequality governs everything

The whole failure class, in one line

Every symptom — daily-plan 500s, ECHECKOUTTIMEOUT at rollover, the 27.8s past-tests endpoint, the silent 10-day regression — is the same thing: dataset-proportional work run synchronously on an unbounded, unobservable pool. The pressure on Postgres is one product, and today both terms are uncapped. The fix bounds both.

DB pressure ≈ uncapped → boundedactive_requests × uncapped → boundedquery_seconds

click → bound both terms · today: both uncapped — peak crosses ~55 usable connections and the pool seizes

Three axes do it: cap query_seconds (indexes + page-shaped queries + enforced statement_timeout), cap & partition active_requests (one pooler chokepoint with per-tier budgets), and remove heavy work from the request path (async generation + amortized caches). The seven pillars below, each with the before→after that makes it concrete.

Daily-plan generation → async & decoupled

Structural

Stop generating the plan on a web connection: pre-compute ahead of rollover, fall back to a bounded async worker, and let an idempotent INSERT…ON CONFLICT — not a request-path lock — guarantee correctness.

Problem today

When no plan exists, /today runs the ~11s 6-CTE generation synchronously on the inbound connection, with a bare INSERT (no ON CONFLICT). At local midnight every plan expires at once → multi-tab/device first-requests each run 11s and pin a connection → the rollover stampede that drained the pool.

Permanent fix

Move generation into a Trigger.dev job on a bounded queue (concurrencyLimit 8); /today returns 202 generating in <100ms and the client polls. A scheduled pre-compute warms plans per-timezone before rollover, so the common case is a sub-100ms existing-plan read. INSERT…ON CONFLICT (user_id, plan_date) DO NOTHING + a job idempotencyKey make duplicates impossible — no request-path lock needed (a lock would itself hold a connection).

Why it scales

Request latency is decoupled from work latency; expensive DB sessions are fixed at ≤8 regardless of how many users roll over. The herd becomes a bounded queue, not a connection storm.

split getOrGenerateDailyPlan → cheap read + enqueue; extract generateDailyPlanForUser()
new generate-daily-plan.trigger.ts (queue concurrency 8, SET LOCAL statement_timeout 8s)
new hourly precomputeDailyPlans scheduled task (per-tz cohorts)
createDailyPlan → INSERT…ON CONFLICT DO NOTHING RETURNING

day-rollover · ~30 requestsconnection pool · 90

⏱ 00:00

before · synchronous · holds 11s

ECHECKOUTTIMEOUT · 90/90

after · 202 in <100ms + bounded queue 202 generating 202

worker queue · concurrencyLimit 8

pre-warmed: plans ready

Content-derived stats → amortized cache

Structural

Stop re-grinding the entire content bank on every request — compute the user-independent aggregates once, read them in microseconds forever after.

Problem today

~80% of the 11s generation is recomputing aggregates identical for every user: a 77,231-row system-flashcard scan, a 113,900-row validated-question COUNT(DISTINCT) + disk sort, 1,738 random exercise-asset heap reads. Re-ground from scratch on every plan.

Permanent fix

A precomputed subject_content_stats / topic_content_stats layer (plain UPSERT-able tables, refreshed on content change + short TTL). The hot path then computes only the small per-user delta; COUNT(DISTINCT)→COUNT (PK-unique keys) drops three external-merge sorts.

Why it scales

DB work goes from O(requests × content) to O(content-changes) + O(tiny per-user delta). Adding users no longer multiplies the heavy scan — it's already done.

new tables + content-change UPSERT hooks / scheduled refresh
Q1 rewritten to JOIN the stats tables, compute only user-specific CTEs

compute per requestcompute once · read forever

11s / request

113,900 questions

77,231 flashcards

1.18M assets

<100ms / request

computed once →

content
stats

+ per-user delta

execute_sql RPC → parameterized repository

Structural

Separate the data from the SQL: one repository-layer change seals injection, makes every slow query visible again, and lets Postgres reuse plans.

Problem today

52 sites build SQL by string interpolation and run it as the anon role via a SECURITY DEFINER RPC. Four user-facing endpoints take raw input straight into the SQL text — HIGH/CRITICAL injection. And because literals embed in a dynamic EXECUTE, the whole class collapses under one pg_stat_statements id → invisible to slow-query ranking.

Permanent fix

Replace every site with Prisma $queryRaw + $N binds behind a typed repo and a single safeSql allowlist (UUID/enum/ORDER-BY validation). Then drop the RPC and revoke the anon EXECUTE grant.

Why it scales · three wins, one change

① injection becomes structurally impossible (data can only be a value, never structure); ② one normalized queryid per query → ranking works; ③ Postgres caches one generic plan per shape instead of re-parsing every call.

new src/repository/_sql/safeSql.ts (only sanctioned builder)
migrate 52 sites; delete dead questions.repository.v1.ts; drop execute_sql

string-interpolated (anon)$N bind params

input: all' OR '1'='1

WHERE card_type = 'all' OR '1'='1'

card #1

card #2 — leaked

… all rows

shield: BREACHED

slow-query rank: invisible

input → bound value

WHERE card_type = $1 ⟵ "all' OR '1'='1"

0 rows — treated as a literal string

shield: sealed 🛡

queryid: ranked · plan cached

Query shape → page-before-aggregate

Near-term

Make every read cost the size of the page you return, not the size of the history the user owns.

Problem today

/past-tests joins two one-to-many children to one parent and de-dups over the product — 4M rows, two disk sorts, 27.8s, no LIMIT. Others GROUP BY + json_agg before the LIMIT, so page 100 costs as much as page 1. And getDeckStats is an N+1 (one RPC per deck in Promise.all).

Permanent fix

Three canonical shapes: page-before-aggregate (id-only WHERE+ORDER+LIMIT CTE first, then LEFT JOIN LATERAL json_agg on the ~20 survivors); no cartesians (trusted denormalized counts + per-child LATERAL); no N+1 (one WHERE x = ANY($ids) GROUP BY).

Why it scales

O(tests×topics×questions) → O(page_size). Page 1 and page 50, a new user and a 2,000-test user, all cost the same. p99 stops tracking the largest account.

aggregate-then-paginate · cartesianpage-first · O(page)

scan 4,000,341 → return 20

json_agg(DISTINCT) ⟶ 💾 spill · 27.8s

scan 20 → return 20

id-CTE → LATERAL · <50ms

Indexing → hot paths become index-only

Immediate

Put WHERE + ORDER BY + projected columns inside the index, so the planner answers from a narrow B-tree and never random-reads the 2.1 GB heap.

Problem today

question_tags has only a composite PK, so every tag_id filter seq-scans 119k rows (the 12s quiz-gen + ~10s RAG queries). validation_status is absent from the flashcards index, forcing 20k–77k random heap fetches on the 2.1 GB table (deck-subjects 6.5s).

Permanent fix

One rule: per hot path, a covering/partial index holding the filter + sort columns (INCLUDE the projected ones) → Index-Only Scan, heap untouched. Highest-leverage: question_tags(tag_id, question_id); the flashcards covering/partial set; tests(user_id, created_at).

Why it scales

Total-table-size work → result-set-size work. The cache-resident narrow index grows like log(n); the 2.1 GB heap that doesn't fit in cache is never touched.

index → random heap readsindex-only · 0 heap

heap fetches → 24,916 · 6.5s flashcards heap · 2.1 GB

heap fetches → 0 · <20ms covering index · in-cache

Connection & concurrency architecture

Immediate

Turn the connection pool into a fixed-size, partitioned, self-draining resource that no single query or request herd can ever monopolize.

Problem today

A bare new PrismaClient() (no connection_limit), 57 Trigger.dev processes, and the anon PostgREST path all open uncoordinated pools against one 90-connection ceiling, with no enforced statement_timeout through the pooler. Backends and hold-time are both unbounded.

Permanent fix

Supavisor (6543) is the single chokepoint (5432 reserved for migrations). Per-tier pool budgets (web ~30, worker ~15, + system) sum under 90 with headroom; statement_timeout enforced on every path (verified through the pooler, or SET LOCAL); excess load becomes bounded queue wait, never more backends.

Why it scales

Server-side backends become O(1) — fixed at the sum of pool sizes (≤55 by construction) no matter how many replicas, workers, or users. Capacity scales by turning one dial.

shared bank · no timeoutpartitioned · self-draining

unbounded → 90/90 seize

POOL EXHAUSTED

fixed budget · green headroom

web 30

wk 15

sys 34

free

every connection self-frees within statement_timeout → excess load = queue wait, not new backends

Observability & guardrails

Structural

Make every query visible, every regression alarmed, and every blind spot un-mergeable — so the next slow query is caught on Day 1, not at collapse on Day 10.

Problem today

The heaviest class was structurally invisible (collapsed under one queryid); HTTP buckets top out at 5s so an 11–16s request looks like 5.1s; no pool / 500-rate / slow-query alert; 41 console.error bypass Loki. A chronic regression bled for ~10 days with no signal until the pool hit 90.

Permanent fix

Parameterization restores per-query queryids; a Prisma $extends emits a db_query_duration histogram + pool gauge; widen HTTP buckets to 30s; multi-window SLO burn-rate alerts on pool saturation, 500-rate, slow queries; a CI ratchet drives string-SQL 52→0 and forbids regrowth.

Why it scales

Detection flips from O(engineer-hours) hand-EXPLAINing to O(1) automated. MTTD: ~10 days → minutes, independent of traffic or query count.

blind · regression hidden until collapsevisible · alert fires Day 1

Day 10 · COLLAPSE

🔔 alert · Day 1

The durable principles

What to remember when you build the next feature

Decouple request latency from work latency. Heavy, variable-cost work belongs on a bounded worker tier — never on a web connection. The request holds a connection <100ms and returns a 202.

Make cost proportional to the response, not the dataset. Page-before-aggregate + index-only reads make page 1 and page 50, a new user and a 2,000-test user, all cost the same.

Amortize shared work: compute once, read forever. Aggregates identical for every user are precomputed; DB work goes from O(requests × content) to O(content-changes).

Treat the pool as a fixed, partitioned budget. Per-tier sizes sum under max_connections; every connection self-frees within a timeout; excess load is queue wait, not new backends.

Keep user data out of the SQL string. $N binds make injection impossible, cache one plan per shape, and restore a stable queryid — three wins from one mechanism.

Put correctness in the database, not request-path locks. INSERT…ON CONFLICT + job idempotency keys collapse races to one row and one run — no lock that itself holds a connection.

Sequenced, structural — not band-aids

Phased rollout

Phase 1 — Bound the blast radius · foundation, mostly no-code

Supavisor = single chokepoint; Prisma connection_limit (web ~8, worker ~3) + pool_timeout; per-tier pool sizes under 90
Enforce statement_timeout (verify through pooler, else SET LOCAL)

Phase 2 — Make reads page-proportional

page-before-aggregate on list endpoints; kill the /past-tests cartesian; batch the getDeckStats N+1

Phase 3 — Seal & illuminate the data layer

safeSql allowlist; migrate all 52 execute_sql sites to $queryRaw; drop the RPC + revoke anon EXECUTE

Phase 4 — Decouple heavy generation

daily-plan async (queue concurrency 8, idempotency, ON CONFLICT, 202); content-stats cache; pre-compute job

Phase 5 — Close the loop

query-duration histogram + pool gauge; widen HTTP buckets to 30s; SLO burn-rate alerts; CI ratchet string-SQL 52→0

Target state · measurable

How we know it's done

✓/daily-plan/today p99 <100ms · 100% of generation off the web tier on a queue of 8 · zero 23505/500 races

✓Warm SLOs at 10× DAU: quiz <300ms, lessons <50ms, past-tests <50ms (from 27.8s) — no unbounded full-history endpoint

✓Pool a fixed budget: backends ≤~55/90 by construction · web/worker isolated · alert 0.7 warn / 0.9 page · zero ECHECKOUTTIMEOUT at rollover

✓Zero string-built SQL: 52→0 · anon EXECUTE revoked · injection surface = 0 · all SQL via parameterized $queryRaw

✓Amortized stats: Q1 from ~11s to sub-100ms residual · 3 disk sorts gone · staleness gauge alarmed

✓MTTD for a slow-query regression: ~10 days → minutes · every query one queryid + one histogram series