How trees work — a novice's field guide
From binary search trees to B-trees and LSM trees. Why sorted data structures exist and where each one shines.
Start with a phone book
Imagine you have a printed phone book with a million names, sorted alphabetically. You want to find 'Priya Shah'.
You don't scan from the first page. You open roughly in the middle — say you land on 'Mendoza'. Priya comes after. You jump to the middle of the second half — land on 'Ramirez'. Still past it. Keep halving.
After about 20 jumps, you've found her. Out of a million names.
If the phone book were shuffled randomly, you'd have to read every single page to find Priya. That's 'O(n)' — tens of thousands of times slower at this size. Sorted order is a superpower.
A tree is a data structure that gives you this for free
A binary search tree (BST) is the simplest tree. Each node holds a value and has up to two children.
[10]
/ \
[5] [15]
/ \ / \
[2] [8] [12] [20]The rule: everything in the LEFT subtree is smaller than the node, everything in the RIGHT subtree is larger. That rule is enforced every time you insert — and it's what makes searching fast.
To find 12: start at 10. 12 > 10, so go right. You're now at 15. 12 < 15, so go left. You've found 12. Three steps instead of checking all seven nodes.
In code
function insert(node, value) {
if (!node) return { value, left: null, right: null };
if (value < node.value) {
node.left = insert(node.left, value);
} else if (value > node.value) {
node.right = insert(node.right, value);
}
return node;
}
function find(node, value) {
if (!node) return null;
if (value === node.value) return node;
if (value < node.value) return find(node.left, value);
return find(node.right, value);
}The magic: range queries
Here's where trees beat hash tables badly. Say you want 'everything between 8 and 20'.
On a hash table, you'd have to check EVERY SINGLE KEY. There's no concept of 'nearby' — keys are scattered randomly by design.
On a tree, you just walk the tree in sorted order starting at 8 and stop as soon as you pass 20. Because sorted order is baked into the structure.
function range(node, low, high, results = []) {
if (!node) return results;
// Only recurse left if something there could be ≥ low
if (low < node.value) range(node.left, low, high, results);
// Collect if we're inside the range
if (low <= node.value && node.value <= high) results.push(node.value);
// Only recurse right if something there could be ≤ high
if (high > node.value) range(node.right, low, high, results);
return results;
}Range query on a binary search tree
Pick a low and high bound and watch the traversal. Nodes in range turn green; whole branches that can't contain any answer are pruned in gray. This pruning is the whole reason range queries are fast on trees.
What else could I use for range queries?
Trees aren't the ONLY way to answer 'give me everything between X and Y'. Here's the full menu, roughly ordered from 'naive' to 'production-grade':
- Linear scan of unsorted data — O(n). Check every item. Fine for small lists, catastrophic at scale.
- Hash table + filter — also O(n). Hashing buys you nothing here; hash tables don't preserve order, so you still have to inspect every bucket. Common novice trap: 'but my hash table is fast!' — it is, for point lookups. Not for ranges.
- Sort on demand — take your unsorted data, sort it, then binary-search into the sorted array. Great for one-off batch jobs. Terrible if data changes often, because you re-sort on every insert.
- Pre-sorted array — keep data sorted at all times. Reads are O(log n + k). But inserts become O(n) because you have to shift elements. Fine for read-mostly workloads.
- Tree (BST, AVL, red-black, skip list) — O(log n) insert AND O(log n + k) range reads. This is the sweet spot. Both operations stay cheap as the data grows.
- B-tree / B+ tree — a tree optimized for disk. Same algorithmic properties as a balanced BST but each node is a disk page with hundreds of keys, so random I/O is minimized. This is what relational databases use.
- LSM tree — sorted, but writes are optimized. Reads cost a little more (multiple levels to check). Good when writes dominate.
- Specialized ordered indexes — Redis sorted sets (skip list internally), Elasticsearch (Lucene's BKD tree for numeric ranges + inverted index for text), ClickHouse's sparse indexes on sorted data. All tree-flavored under the hood.
Why 'tree' specifically wins
The reason trees come up over and over is that they hit a rare sweet spot: they keep BOTH operations cheap. Inserts stay O(log n). Range reads stay O(log n + k). Nothing else gives you both.
- Linear scans are cheap to insert into but slow to query.
- Pre-sorted arrays are cheap to query but slow to insert into.
- Hash tables are cheap for everything EXCEPT range queries.
- Trees maintain order without paying the array-insert cost. That's the magic.
Balance matters (a lot)
A binary tree works great IF it stays balanced — roughly the same depth on both sides. If it gets lopsided, your 'log n' performance collapses.
Watch what happens if you insert [1, 2, 3, 4, 5] in order into a plain BST:
[1]
\
[2]
\
[3]
\
[4]
\
[5]This is no better than scanning a shuffled list. The solution is self-balancing trees — data structures that rearrange themselves as you insert to keep both sides roughly equal.
- AVL trees — strict balance, slightly slower inserts, faster reads.
- Red-black trees — looser balance, what Java's TreeMap and C++'s std::map use.
- Skip lists — not technically a tree, but achieves similar log-n behavior using probability. Redis uses these for sorted sets.
But real databases don't use binary trees
Binary trees are perfect for teaching but terrible for databases. Here's why.
When a database has a billion rows, a balanced binary tree is ~30 levels deep. Looking up a row means potentially 30 disk reads. Disks (even SSDs) are WAY slower at random reads than sequential reads — each level of the tree could be a different disk page. That's ~30 random I/Os for ONE lookup.
The fix: make each tree node BIGGER. Instead of 2 children per node, have hundreds.
Enter the B-tree
A B-tree (the 'B' is just a letter — don't read into it) is a tree where each node holds MANY keys and has MANY children. A typical B-tree node in a database might hold 100-500 keys.
[10 | 20 | 30]
/ | | \
[2,5,8] [12,15] [22,25] [32,35,40]Finding 25 in this tree: look at root → 25 is between 20 and 30 → follow that pointer → now at [22,25] → found. Two hops instead of many.
With a branching factor of 500, finding ONE record in a BILLION items takes only about 4 hops (log base 500 of a billion ≈ 3.4). Four disk reads instead of thirty.
Why the branching factor matters so much
- Fewer levels = fewer disk reads per lookup.
- Each node matches a disk page, so one I/O brings in hundreds of keys.
- Range scans walk sideways through leaf nodes (in B+ trees, the leaves are linked) — very cache-friendly.
- Trade-off: inserts occasionally need to 'split' a full node, which means writing two pages. Real engines buffer this.
When writes are crazier than reads: LSM trees
B-trees handle balanced read/write workloads beautifully. But what if you have insane write traffic — a logging system, a metrics database, a write-heavy IoT pipeline taking a million writes per second?
Every B-tree write might need to read a page off disk, modify it, and write it back. If the page is full, split it. That's a lot of I/O per write.
LSM (Log-Structured Merge) trees take a completely different approach:
- All writes go into a small in-memory buffer (called a 'memtable' — often a skip list internally).
- When the memtable fills up, it's sorted and flushed to disk as a single immutable file, called an SSTable ('Sorted String Table').
- New SSTables keep stacking up. A background process periodically merges smaller older SSTables into larger ones — this is called compaction.
- Reads check the memtable first, then each level of SSTables in turn. Bloom filters tell the engine to skip files that definitely don't contain the key.
The upside: writes are almost always just a memory write plus an occasional sequential disk flush. Sequential I/O is WAY faster than random I/O. Writes go brrr.
The downside: reads might have to check multiple files (memtable + several SSTable levels). Compaction uses CPU and disk in the background. Read amplification is the main cost you pay.
Cheat sheet: when to use what
- Point lookup only ('give me user 42') — hash table wins. O(1) is unbeatable.
- Range queries ('users 100-200') — tree. Sortedness is the whole game.
- Top-N queries ('top 100 by score') — tree / sorted index.
- Sorted iteration ('iterate alphabetically') — tree. Hash tables lose ALL ordering info.
- Write-heavy, ok with read amplification — LSM tree.
- Balanced read/write, ok with write amplification — B-tree.
- In-memory, smallest possible lookup — hash table beats both.
How this connects to the 'Hash vs Tree' question
Now the architectural trade-offs in the system-design question should feel concrete:
- 'Hash partitioning is fast for point reads' — because hash → partition is arithmetic, no tree traversal or sorted lookup involved.
- 'Hash partitioning destroys locality' — because the whole point of hashing is that adjacent keys go to different places. You can't do a range scan without talking to every shard.
- 'Range-partitioned tree needs coordination' — because to find which node owns key K, you have to walk (or look up) a metadata tree. And if the design is strongly consistent, the owner node serializes reads through its leader.
- 'Redis sorted sets', 'Elasticsearch', 'LSM tree' in the hybrid option — these are all ordered structures that specialize in range/top-N queries. The hybrid design pairs a hash-based primary (fast point reads) with one of these ordered secondaries (fast range queries).
Further reading
- 'Designing Data-Intensive Applications' by Martin Kleppmann — Chapter 3 covers B-trees and LSM trees in accessible depth. The best single book on this material.
- CMU 15-445 (Database Systems) — free video lectures from Carnegie Mellon, very approachable.
- 'Database Internals' by Alex Petrov — deeper if you want to implement one.
- Redis sorted sets use skip lists internally. The Redis source is surprisingly readable if you want to see an ordered structure in C.