advanced
session
latency
availability
consistency
sharding
· 14 min

Hash vs Tree: Designing a mixed-read global store

TL;DR
When your workload is asymmetric — heavy point reads AND non-trivial range queries — a single data structure can't win on both axes. Hash partitioning nails point reads; ordered structures nail ranges. The right move is usually a hybrid: hash-partitioned primary + asynchronously-maintained ordered secondary, with a freshness contract the product can live with.
Question
Which architecture best matches the workload — millions of point reads/sec, frequent range/top-N reads, acceptable (not strict) freshness for leaderboards, horizontal scale, and low point-read tail latency?
Look for a rubric mismatch: which options sacrifice one requirement to optimize another?
Pick an option — click again to collapse.

Why this works — the underlying principle

The root insight: a data structure optimized for one query shape is usually pessimized for the other. Hash tables and tree-like structures sit on opposite ends of a locality trade-off.

  • Hash: uniform distribution, O(1) point lookups, no locality between neighboring keys. Range queries require scatter-gather.
  • Tree/ordered: preserves locality, O(log n) point lookups, O(log n + k) range queries. Writes serialize at the owning node, and hot ranges become hot nodes.

When the workload is asymmetric (i.e., point reads AND range reads), the correct move is asymmetric systems — don't try to make one structure do both jobs. This is why the hybrid wins.

The consistency conversation

The magic word in the scenario is 'acceptable freshness.' Interviewers use that phrase deliberately — it tells you the product can live with eventual consistency on the leaderboard path. That permission is what unlocks the async-secondary architecture.

What if the scenario demanded strict consistency?

Suppose the interviewer instead said: "the leaderboard must always show the current true ranking — no stale reads." Option C breaks immediately. The hybrid architecture is built on the assumption that a few seconds of async lag are fine; that lag is the reason it scales. You cannot bolt strong consistency onto a CDC pipeline without giving up the independence that made the design attractive in the first place.

In that world the ranking of options flips:

  • Option C drops out — the freshness contract it depends on is no longer available to you.
  • Option B moves up — range-partitioned with leader-based strong consistency is designed for exactly this workload. You are now paying the coordination cost on purpose, for a guarantee the product actually needs.
  • Option D becomes the textbook answer if the question also demands that all regions agree on the ranking at the same moment (linearizability + global ordering). Spanner, CockroachDB, FoundationDB, and TiDB target this shape — consensus-backed ordered indexes where both point and range queries hit one strongly-consistent source of truth.
  • Option A still loses — hash partitioning destroys the locality a strongly-ordered leaderboard needs, regardless of consistency.

What you accept when you flip to B or D:

  • Lower per-shard throughput — consensus rounds and quorum reads aren't free.
  • Higher tail latency on point reads — every read goes through a leader or a quorum check, not an any-replica local hop.
  • A throughput ceiling measured in tens of thousands of ops per shard, not millions per cluster. Strongly-consistent systems don't match the raw scale of hash-partitioned KV stores — that's the trade you're making.
  • Much higher operational complexity — running a globally-consistent database is a specialty practice, not something you'd adopt unless the product needs it.

A useful exercise: re-read the original scenario and ask yourself what would change about your answer if each requirement were dropped or strengthened — 'millions of lookups per second' → 'thousands', 'acceptable freshness' → 'strict', 'survive node failures' → 'survive region failures'. Every lever changes the optimal design; interviewers test whether you can reason about those levers, not whether you memorized one answer.

Operational concerns you should raise

  • CDC lag monitoring: instrument end-to-end lag from primary write to secondary visibility. Set p99 and p999 SLOs on it.
  • Dual-write avoidance: do NOT dual-write to primary and secondary from the application — that path creates silent consistency bugs. Always go primary → CDC → secondary.
  • Index rebuild: have a story for rebuilding the secondary from the primary (full resync from a snapshot + catch-up from the log).
  • Capacity planning: the secondary often needs more memory than expected because ordered indices hold auxiliary structures (skip lists, inverted indices).
  • Failure injection: test what happens when CDC is paused for 30 minutes — what does the UX look like when leaderboards are stale?

Interviewer follow-ups

  • What if the scenario said 'must never show stale leaderboard'? → You move toward B (range-partitioned with strong reads) or D (consensus). Cost goes up; scale ceiling goes down.
  • How would you handle hot keys in the primary? → Request coalescing, in-memory L1 in front of the KV, possibly dedicated replicas for known hot keys.
  • How do you paginate top-N stably when scores change? → Snapshot the secondary with a version/epoch and read within that epoch for the paginated session.
  • How do you handle deletes? → Tombstones in the primary, tombstone propagation via CDC, TTL-based cleanup in the secondary. Critically, make sure the secondary handles the tombstone BEFORE a read skips the deleted key.
  • What does observability look like? → RED metrics on both tiers, CDC lag histogram, secondary freshness heatmap, read-path fallback rate.