Computing Sharding with Einsum

Dev’s math hack to split giant models ignites comment war

TLDR: A developer pushes using einsum, a compact math notation, to reason about splitting big AI models and computing gradients faster. Comments erupted: some praise the clarity, others call it unreadable and say tools should handle it, highlighting a broader fight over math-first vs tool-first engineering approaches.

A researcher says there’s a faster way to think about splitting huge AI models across machines: use einsum, a compact math shorthand, to reason about sharding and even the “backwards” training step without memorizing tricky flips. The crowd immediately split—ironically—into camps. Fans cheered the elegance: write something like “bi,oi->bo” and just swap the index sets to get gradients, no guesswork. Skeptics rolled their eyes: those symbols look like alien runes, and real engineers just draw boxes or lean on high‑level tools. One commenter joked they now “dream in ‘ij,jk->ik’,” while another posted a meme of a decoder ring next to the einsum docs. Gatekeeping allegations flew, with newbies pleading for pizza-slice explanations of sharding (“it’s just dividing the data pie”), and purists insisting math literacy beats clicky diagrams. The pragmatic crew chimed in: let the compiler and frameworks handle it, you just write code—no need to cosplay as Einstein. Still, the stakes are big: getting sharding right means fewer bugs and faster training on monster models, whether you sketch shards or type tiny letters. For deep-learning tinkerers working with DTensor, this post became the latest battleground of clarity vs cleverness, with memes, math, and mild chaos.

Key Points

•The article advocates using einsum notation to efficiently reason about sharding in distributed tensor computations.
•It defines contraction and free dimensions within einsum and shows how to map common operations like matrix multiply and nn.Linear.
•A simple backward rule is provided: swap the target input’s indices with the output indices to compute gradients via einsum.
•Sharding rules illustrated include replication propagating to replicated output, batch-dimension sharding propagating to output, and free-dimension sharding requiring replicated broadcasted inputs.
•The discussion of contraction-dimension sharding begins but the content is truncated, indicating further rules were intended.

Hottest takes

"I, too, dream in 'ij,jk->ik'—finally someone said it out loud" — tensorFan

"If I need a decoder ring to read your code, your 'shortcut' isn't a shortcut" — boxDrawer

"Real engineers let the compiler handle sharding; this is cosplay math" — prodOpsGuy

January 29, 2026

When symbols hit the fan

Dev’s math hack to split giant models ignites comment war

Key Points

Hottest takes

January 29, 2026

When symbols hit the fan

Computing Sharding with Einsum

Dev’s math hack to split giant models ignites comment war

Key Points

Hottest takes

Save News