January 31, 2026
Cache me if you can
Sparse File LRU Cache
Big files, tiny rent: “ghost bytes” speed-ups spark “just compress it!” brawl
TLDR: Amplitude uses “sparse files” to cache only the parts of big data files people actually read, saving pricey SSD space and cloud fetches. Comments clash over whether this clever setup beats the simpler answer—compress everything—with debates on speed, cost, and complexity making it the week’s storage soap opera.
Amplitude rolled out a crafty storage move: sparse files—think “pretend-empty” parts of a file that don’t take space until they’re actually used—so they can cache only the popular chunks of data from S3 onto speedy (and pricey) local SSDs. It’s paired with an LRU (least‑recently‑used) eviction policy tracked in a tiny on-box database (RocksDB), so old chunks get tossed when space runs tight. The pitch: keep whole files logically, but only store the columns people actually read. Faster queries, fewer cloud requests, less disk waste.
Then the comments lit up. The loudest chorus? “Why not just compress it?” One snarky zinger, from avmich, basically boiled it down to: why invent a clever cache when filesystem-wide compression exists—cue the “use ArchWiki” crowd flexing receipts. Defenders fired back that compression and sparsity do different jobs: one shrinks what you have; the other avoids storing what you don’t need. Pragmatists cheered the reduced cloud bill and simpler reads. Purists cried “complexity creep” and “just buy more SSDs.”
Meanwhile, the meme machine went wild: “ghost bytes,” “diet files,” and “Marie Kondo storage—only keep what sparks joy.” Tech nuance met comment-section chaos, and somehow everyone’s right—and very online.
Key Points
- •Amplitude caches columnar analytics data from Amazon S3 onto local NVMe SSDs to reduce S3 fetch costs and latency.
- •Caching entire files wastes space; per-column files reduce waste but create excessive file counts and metadata overhead, especially with many small columns.
- •Sparse files provide a middle ground: logically cache full files while physically storing only logical blocks containing requested columns.
- •A local RocksDB tracks which logical blocks are present and their last-read times to approximate an LRU eviction policy.
- •Logical block sizes are variable, with smaller blocks near metadata headers (similar to Parquet) and larger blocks elsewhere to align with access patterns.