Projects 7 min read

The Data Wing: Six Short Films on How Distributed Data Behaves

Six short films on what happens to data the moment it stops fitting on one machine: the forced choice, the dial, taste, the cache, the branch library, the shard. The first wing of Learning Maps, the durable principle under each rented AWS label.

The Data Wing: Six Short Films on How Distributed Data Behaves

A storm takes down the one phone line between two libraries. A reader is standing at the desk wanting to change a record. You, the librarian, have exactly two options and no third one. Serve her now and let the two buildings disagree for a while, or refuse until the line is back and the buildings can agree first. There is no version where you get both. That is the whole of distributed systems in one scene, and it is where the Learning Maps Data Wing opens.

The Data Wing is six short films, the first arc of the series. Each one takes a single idea about data that lives in more than one place and tells it as a story you can follow on a walk with your eyes closed. The hand-drawn napkin map is there if you want to watch, but the lesson lives in the words and the silence, not the picture.

I built this while studying for the AWS Solutions Architect exam, and got tired of the way the material is usually taught: a pile of service names to memorize and forget. The book that flipped my perspective was O'Reilly's System Design on AWS. It made me want the systems underneath the exam, not just the service names, and to carry that thinking straight into my day work in platform engineering and the systems-analysis mapping I already do. The exam is this year's vocabulary. The thing underneath it has not changed since the first time two computers had to agree on a number, and it will still be true when AWS has renamed half its products. So every lesson does one specific thing. It splits the durable principle, the physics that does not change, from the rented label, the marketing name you are paying for this year. You learn the principle as the spine. The buzzword is a sticker you peel off when the vendor rebrands.

Here is the arc, room by room.

1. The Forced Choice

Watch The Forced Choice on YouTube (9:05).

When a network splits a system in two, you can keep every copy in agreement or you can keep answering, but not both at once. This is not a design flaw you can engineer around. It is a law. The label is CAP, and then PACELC for the quieter version that holds even when nothing is broken: strong agreement always costs a trip across the network, which costs milliseconds on every read. The principle is simpler than the acronym. When a replicated system cannot talk to itself, it has to trade agreement for answers.

2. The Dial

Watch The Dial on YouTube (6:22).

Agreement is not a switch, it is a dial. Spread your data across three buildings, then decide how many have to sign off before a write counts and how many you check before you trust a read. If those two numbers overlap, the read is guaranteed to see the latest write. The label is quorum, R plus W greater than N. The principle is that you get to place the trade exactly where you want it instead of accepting someone's default. On AWS that whole dial is hidden behind one checkbox and a price: a strongly consistent read costs about twice an eventual one.

3. Just Enough Agreement

Watch Just Enough Agreement on YouTube (6:16).

Strong consistency everywhere is the expensive default you almost never actually need. A reader updates her address in one building, drives across town, opens another, and sees the old one. She has lost nothing; it is syncing. But she feels lied to. The lazy fix is to make every read everywhere strong forever, which is building a cathedral to fix a doorbell. The taste move is to send her next read to the building that took her write, and leave everyone else on the cheap path. You do not buy correctness. You buy the specific guarantee the story needs, and the cheapest one that holds.

4. The Cache

Watch The Cache on YouTube (5:32).

A cache is a bet that the recent past predicts the near future. You keep a copy of the answer right next to the reader, knowing it might be a few seconds stale, because the walk you save is worth the risk. It only pays off when people ask for the same things over and over. If every request is unique, your hit rate is zero and you have added cost for nothing. The label is ElastiCache, hit, miss, TTL. The principle is that you pay for speed in staleness, and the only thing that makes the bet smart is repetition.

5. The Branch Library

Watch The Branch Library on YouTube (7:23).

One copy is not enough, for two opposite reasons: to survive a fire, and to serve more readers. So you replicate. One building leads and takes the writes; the others follow and serve reads. The toll is lag, the gap between a change at the leader and the same change at a follower. Wait for the followers to confirm and you are safe but slow. Fire and forget and you are fast but you can lose the tail if the leader burns first. The architecture picks itself by the fear you name first. Afraid of losing data, you choose Multi-AZ. Afraid of slow writes, you choose read replicas. Naming the goal first is the entire trick the exam keeps testing.

6. The Shard

Watch The Shard on YouTube (7:27).

Replication puts a whole copy everywhere. But what if the data is too big for any one machine? Then you stop copying and start cutting. Authors A through F in one building, G through M in the next, N through Z in the third. No building holds the whole library; each holds a slice. The beautiful part is that the rule that cuts the data is the same rule that finds it. That rule is the shard key, and it decides everything: whether load spreads evenly or piles onto one hot building, whether your everyday queries stay in one slice or have to scatter across all of them. Pick it badly and one shard throttles while the rest sit idle.

What the wing adds up to

By the end of these six you have the spine of how distributed data behaves. The storm forces the choice, the dial sizes it, taste is how you spend it, the cache buys speed with staleness, the branch library spreads the copies, the shard cuts the data. After that, the AWS services stop being a mystery and become a lookup. You are not memorizing which product to pick. You are naming the trade, and letting the service pick itself.

Consistency is never free. Availability, latency, freshness, coverage, you are always trading one for another. Learn to name the trade and most of system design stops being intimidating.

Watch the Data Wing on Learning Maps, where the rest of the map, the systems-talking wing, the security and cost wings, and the capstones that design whole systems end to end, is laid out wing by wing. If you want the frame the whole thing sits on, I wrote about how AWS is math and Kubernetes is physics: two ways of reasoning about infrastructure, and why the bill and the latency are the only honest feedback you get.