Hash vs Tree: Designing a mixed-read global store

TL;DR

When your workload is asymmetric — heavy point reads AND non-trivial range queries — a single data structure can't win on both axes. Hash partitioning nails point reads; ordered structures nail ranges. The right move is usually a hybrid: hash-partitioned primary + asynchronously-maintained ordered secondary, with a freshness contract the product can live with.

In plain Englishbefore you look at the options

Imagine you're building the backend for a massive online game. Players log in, and you need to answer two very different questions — both really fast, both at huge scale:

"What's player #123's session?" — a direct lookup of one specific thing.
"Show me the top 100 players right now" — a sorted slice across the whole dataset.

These two questions want different tools. Hash tables are built for (1): give them an exact key and they jump straight to it — but they scatter data randomly, so "top 100" means checking every shard and merging the results. Trees / sorted structures are the opposite: they keep things in order (great for "top 100"), but every single lookup has to coordinate with whatever node owns that slice of the keyspace.

Why does the tree pay coordination on every lookup, though? Because an ordered structure split across nodes has to assign each slice of the keyspace to an owner node. To read key K you have to (a) figure out WHICH owner to ask — a routing step via a metadata service — then (b) actually ask them, and if the design uses strong consistency (usually the reason you'd pick an ordered structure in the first place) the owner serializes your read through its leader or a read-quorum before answering. Hash partitioning skips every one of those steps: the mapping from key → node is arithmetic (hash(key) mod shards). You compute it locally, talk to any replica, done. That's what "no coordination on the lookup path" actually means — the route is arithmetic, not a conversation.

Now stack on the scale: millions of point lookups per second. That's the traffic profile where those small coordination costs on the lookup path become huge at the tail.

So the four options are really four different answers to: "where do I put the coordination cost, and what am I willing to give up to get fast point reads?" One commits fully to hash (fast point reads, bad range queries). One commits fully to ordered/tree (good range queries, but pays coordination on every lookup). One tries to enforce strong consistency on everything globally (elegant but expensive at scale). One splits the workload across two systems that each do exactly one job well.

Deeper diveknowledge base article

How trees work — a novice's field guide →

The intuition behind binary search trees, B-trees, and LSM trees — why 'sorted' is such a powerful property, and why databases use these structures. Includes an interactive range-query simulation.

If phrases like 'B-tree' or 'LSM tree' feel hand-wavy, read this first. It builds the intuition the options below assume.

Deeper diveknowledge base article

Secondary indexes — and why you often need a separate system →

Why one database can't answer every kind of query, how Change Data Capture (CDC) keeps a primary store and a secondary index in sync, and where the 'two systems' architecture comes from.

Option C in the question hinges on this exact pattern — worth reading once so 'async ordered secondary' becomes a concrete picture, not just words.

Question

Which architecture best matches the workload — millions of point reads/sec, frequent range/top-N reads, acceptable (not strict) freshness for leaderboards, horizontal scale, and low point-read tail latency?

Look for a rubric mismatch: which options sacrifice one requirement to optimize another?

Pick an option — click again to collapse.

Why this works — the underlying principle

The root insight: a data structure optimized for one query shape is usually pessimized for the other. Hash tables and tree-like structures sit on opposite ends of a locality trade-off.

Hash: uniform distribution, O(1) point lookups, no locality between neighboring keys. Range queries require scatter-gather.
Tree/ordered: preserves locality, O(log n) point lookups, O(log n + k) range queries. Writes serialize at the owning node, and hot ranges become hot nodes.

When the workload is asymmetric (i.e., point reads AND range reads), the correct move is asymmetric systems — don't try to make one structure do both jobs. This is why the hybrid wins.

The consistency conversation

The magic word in the scenario is 'acceptable freshness.' Interviewers use that phrase deliberately — it tells you the product can live with eventual consistency on the leaderboard path. That permission is what unlocks the async-secondary architecture.

What if the scenario demanded strict consistency?

Suppose the interviewer instead said: "the leaderboard must always show the current true ranking — no stale reads." Option C breaks immediately. The hybrid architecture is built on the assumption that a few seconds of async lag are fine; that lag is the reason it scales. You cannot bolt strong consistency onto a CDC pipeline without giving up the independence that made the design attractive in the first place.

In that world the ranking of options flips:

Option C drops out — the freshness contract it depends on is no longer available to you.
Option B moves up — range-partitioned with leader-based strong consistency is designed for exactly this workload. You are now paying the coordination cost on purpose, for a guarantee the product actually needs.
Option D becomes the textbook answer if the question also demands that all regions agree on the ranking at the same moment (linearizability + global ordering). Spanner, CockroachDB, FoundationDB, and TiDB target this shape — consensus-backed ordered indexes where both point and range queries hit one strongly-consistent source of truth.
Option A still loses — hash partitioning destroys the locality a strongly-ordered leaderboard needs, regardless of consistency.

What you accept when you flip to B or D:

Lower per-shard throughput — consensus rounds and quorum reads aren't free.
Higher tail latency on point reads — every read goes through a leader or a quorum check, not an any-replica local hop.
A throughput ceiling measured in tens of thousands of ops per shard, not millions per cluster. Strongly-consistent systems don't match the raw scale of hash-partitioned KV stores — that's the trade you're making.
Much higher operational complexity — running a globally-consistent database is a specialty practice, not something you'd adopt unless the product needs it.

A useful exercise: re-read the original scenario and ask yourself what would change about your answer if each requirement were dropped or strengthened — 'millions of lookups per second' → 'thousands', 'acceptable freshness' → 'strict', 'survive node failures' → 'survive region failures'. Every lever changes the optimal design; interviewers test whether you can reason about those levers, not whether you memorized one answer.

Operational concerns you should raise

CDC lag monitoring: instrument end-to-end lag from primary write to secondary visibility. Set p99 and p999 SLOs on it.
Dual-write avoidance: do NOT dual-write to primary and secondary from the application — that path creates silent consistency bugs. Always go primary → CDC → secondary.
Index rebuild: have a story for rebuilding the secondary from the primary (full resync from a snapshot + catch-up from the log).
Capacity planning: the secondary often needs more memory than expected because ordered indices hold auxiliary structures (skip lists, inverted indices).
Failure injection: test what happens when CDC is paused for 30 minutes — what does the UX look like when leaderboards are stale?

Interviewer follow-ups

What if the scenario said 'must never show stale leaderboard'? → You move toward B (range-partitioned with strong reads) or D (consensus). Cost goes up; scale ceiling goes down.
How would you handle hot keys in the primary? → Request coalescing, in-memory L1 in front of the KV, possibly dedicated replicas for known hot keys.
How do you paginate top-N stably when scores change? → Snapshot the secondary with a version/epoch and read within that epoch for the paginated session.
How do you handle deletes? → Tombstones in the primary, tombstone propagation via CDC, TTL-based cleanup in the secondary. Critically, make sure the secondary handles the tombstone BEFORE a read skips the deleted key.
What does observability look like? → RED metrics on both tiers, CDC lag histogram, secondary freshness heatmap, read-path fallback rate.