March 5, 2026
Popcorn for the CPU vs GPU wars
Optimizing Recommendation Systems with JDK's Vector API
Netflix squeezes its 'surprise me' scores as commenters yell 'just use GPUs'
TLDR: Netflix cut CPU use by batching its “surprise me” scoring into big matrix math, shrinking servers. Comments split between jokes about binge maniacs driving half the work and a loud “just use GPUs” chorus, with many noting memory speed—not math—is the real bottleneck.
Netflix engineers say their “serendipity” score—the part that decides how different a new show is from what you’ve been watching—was hogging about 7.5% of CPU time. So they batched the work, rearranged memory, and turned tons of small comparisons into one big matrix math move, cutting CPU and shrinking servers. Sounds neat, right? The comments erupted. Big mood: “Why optimize for binge monsters?” One user mocked the stat that just 2% of requests are huge batches but somehow account for around half the workload, joking that the most surprising recommendation is to cancel those power-watchers altogether. Meanwhile, the GPU gang rolled in yelling “just use graphics cards!” with one bragging about a 100x speed-up, before admitting the real villain is shoving data into memory fast enough. Classic internet: CPU vs GPU cage match, with Java nerds flexing clever batching and CUDA cowboys insisting brute force wins. The peanut gallery turned “cosine similarity” (comparing show-vibes) and “embeddings” (show fingerprints) into meme fodder: batch bros versus single-video stans, cache locality versus “throw money at it.” In short: Netflix saved compute, the crowd brought the drama, and everyone learned that memory is the final boss Netflix Tech Blog.
Key Points
- •Serendipity scoring in Netflix’s Ranker consumed about 7.5% of per-node CPU, dominated by Java dot products.
- •The original approach used nested loops computing cosine similarity for M candidates against N history items (O(M×N)).
- •Traffic shape showed 98% single-video requests but 2% large batches accounted for ~50% of total videos processed.
- •Optimization Step 1 batched work by packing embeddings into matrices and computing C = A × B^T after row normalization.
- •The batching and matrix-based approach produced the same scores with lower CPU per request and a reduced cluster footprint.