System Design Framework
REF: https://youtu.be/i7twT3x5yv8?si=i-_8lAJDNhOzgWNL
Understand the problem and establish design scope (5m)
- Define the problem, rephrase it
- Discuss on the direction of the problem
- Clarify constraints
- Rough numbers & SLOs
Propose high-level design and get buy-in (20m)
- Flow:
- User -> API -> Service -> MQ/Stream -> DB -> Cache -> observability (logs, metrics, tracing) ...
- Data models (core)
- API design (core)
- Components & services
- Storage
- Security? audit logs?
Design deep dive (15m)
- Options for the tools mentioned above, deep dive into each one
- Discuss on the trade-offs, risk, and constraints
- SQL vs NoSQL
- horizontal vs vertical scaling
- Caching? invalidation triggers, TTL/jitter, single-flight
- GEO?
- secruity, auth -> audit logs
- testing, rollout, rollback, etc
Wrap up (5m)
- recap the design
- rediscuss on the trade-offs, risk, and constraints
- shard keys?
- cache TTL?
- monitor metrics, logs, tracing
- next steps? (bonus) - talk about what was not asked to design but open up the discussion to show understanding and seniority
- closing
. . .
System-Design One-Pager (print this)
Trigger mantra: SPEC → NUMBERS → FLOW → FAILURE → TRADE-OFFS → ROADMAP.
90-second opening script (memorise):
- “Goal: X delivers Y to Z users. Non-goals: A/B. Success = p95 latency ≤ L, freshness ≤ F, cost ≤ C.”
- “Scale assumptions: QPS, payload size S, daily writes/reads, growth G.”
- “Critical path: request → validate → write/read → persist → respond (happy path first).”
- “State model + data model at a glance (entities, keys, hot partitions).”
- “We’ll design for 1 failure now (idempotency + retries), and evolve here (roadmap).”
Back-of-envelope (fill in fast):
- Throughput:
req/s × payload
→ network/CPU budget. - Storage:
daily writes × avg size × retention
(+ replication factor). - Latency budget: front-end
≤50ms
, edge≤20ms
, core read/write≤120ms
, queue/async≤1–5s
(MVP). - Costs: biggest 2 line items (storage + egress/compute); name them.
Data model checklist:
- Entities & keys (avoid hot keys; include secondary indexes).
- Partitioning/sharding rule (what’s the shard key and why).
- State machine for long-running work (e.g.,
PENDING → RUNNING → RETRYING → SUCCEEDED|FAILED(DLQ)
).
Data-flow narrative (say out loud, in order):
- Ingress format & size → validation → write path → derived data (indexes/vectors) → read path & ranking.
- Idempotency key + retry policy on every write/async step.
- Exactly which store each read hits (and why).
Failure inventory (pick 2 now, note the rest):
- Duplicate requests; partial writes; queue poison message; downstream timeout; hot partition.
- Controls: idempotency keys, exponential backoff, DLQ with alert, circuit breaker, TTLs, SLO + error budget.
Trade-off playbook (2 pros/2 cons, commit):
- SQL vs NoSQL; stream vs batch; cache vs no cache; single region vs multi.
- Name what you give up and how you’d reverse the decision later.
Roadmap ladder (close strong):
- MVP (90 days): 1 region, 1 DB, 1 queue, observability baseline.
- Next 10×: read replicas/ANN index, cache for N hot queries, backfill job, multi-AZ/region, privacy/PII pipeline.
Security & compliance in one breath: AuthN (JWT) → AuthZ (tenant/resource checks) → data segregation → PII handling (hash/tokenise, deletion path) → audit logs.
Why this beats your Synthesis attempt
- You start with the critical path and state machine, not gateways/caches/ES.
- You limit primitives (one DB + one queue) until reads prove otherwise.
- Every box exposes input, output, SLO, and failure behaviour—no black boxes.
- You anchor with numbers so choices feel intentional, not ornamental.
Deep keys (mental models you can reuse)
- Two-Store Rule (MVP): one blob store + one OLTP/OLAP (or OLTP+pgvector). Add ES/Redis only when a measured query needs it.
- One-Queue Rule: one durable queue for long tasks; state machine per job; retries + DLQ from day 1.
- Happy-Path First: prove upload → process → search (or analogous) end-to-end before any embellishment.
- Talk in Deltas: each evolution step should change exactly one constraint (throughput, latency, durability, cost).
- Hot-spot Radar: say where hot keys and fan-outs will occur and how you’ll detect them (metrics & sample alerts).
- Observability trinity: logs with correlation IDs, RED/USE metrics, 1–2 SLOs with error budgets.
Reflection prompts (turn this sting into skill)
Answer these in writing—short bullets:
- Priority drift: When under pressure, what made me reach for “shiny” components before the core flow? How will I catch that impulse (e.g., say the 90-sec script before touching the marker)?
- Narrative gap: Could I explain the entire data path in 60 seconds without diagrams? If not, tighten it.
- State clarity: Where did I fail to name states and transitions? Rewrite them as a table.
- Numbers poverty: Which number did I avoid stating (traffic, size, cost)? Why? Add a default baseline toolkit (below).
- Failure courage: Which failure mode do I habitually ignore? Design one control and one alert for it.
- Life parallel: In your HRBP project, scope crept and you optimised mid-UAT. What’s the SD analogue? → Lock MVP spine; create a “parking lot” for nice-to-haves; only pull from it after the spine stands.
Baseline numbers toolkit (memorise reasonable defaults)
- Payloads: JSON request 1–5 KB; doc text page 2–4 KB; image 200–500 KB.
- Throughput tiers: prototype 10 rps; MVP 100 rps; scale target 1k rps.
- Storage: plain text ~1 byte/char; embed vector 768 floats ≈ 3 KB.
- Queue: Pub/Sub/Kafka consumer stable at O(100–1k msg/s) per worker with I/O-bound tasks.
- Latency: network hop 5–20 ms in-region; disk read 1–10 ms SSD; ANN/vector search 10–80 ms per 10k vectors (ballpark).
- Cost: egress is often the silent killer; cache prevents egress, not just latency.
Red-team yourself (10 interrupts an interviewer might use)
- “Walk the happy path, no optimisations.”
- “Point to each arrow and tell me payload & size.”
- “What’s the shard key? Why won’t it hot-spot?”
- “If this downstream is down for 10 minutes?”
- “How do I retry safely without dup effects?”
- “Which read hits which index?”
- “Show me the job state machine.”
- “Where do you log correlation IDs?”
- “What’s your single biggest cost line?”
- “How do you roll back a bad deploy?”
Practice protocol (repeat weekly)
- 5-min warm-up: recite the 90-sec opening, draw the state machine.
- 15-min drill: pick any common SD prompt (feed, rate limiter, search, chat). Do: numbers → flow → failures.
- 10-min red-team: have a friend interrupt with the 10 questions above.
- Post-mortem note (3 bullets): what I skipped, what I over-engineered, one number I’ll bring next time.
Your cheat sheet, tuned (copy/paste)
System-Design Mistakes — Pocket Checklist
- Vague spec → Write goal, non-goals, SLOs.
- No numbers → Quick calc: QPS, size, storage, latency, cost.
- Messy diagram → Boxes, numbered arrows, legend; pass-1 < 2 min.
- Silent schema → Entities, keys, indexes, hot-key risk.
- No state machine → Jobs: PENDING→RUNNING→RETRYING→SUCCEEDED|FAILED(DLQ).
- No data-flow story → Trace one request with format/size/retry/idempotency.
- Ignoring failures → Choose 2, add backoff, DLQ, circuit breaker, runbook trigger.
- Zero trade-offs → For each big choice, 2 pros/2 cons + reversal path.
- No roadmap → MVP spine now; next 10× bottlenecks later.
- Security & cost blindspot → AuthN/Z, PII path, top 2 cost drivers.
Mantra: SPEC → NUMBERS → FLOW → FAILURE → TRADE-OFFS → ROADMAP.
I have built a repo as a preparation hub: - Interview Preparation Materials - ByteByteGo - System Design Interview