Computing Sharding with Einsum

Dev’s math hack to split giant models ignites comment war

TLDR: A developer pushes using einsum, a compact math notation, to reason about splitting big AI models and computing gradients faster. Comments erupted: some praise the clarity, others call it unreadable and say tools should handle it, highlighting a broader fight over math-first vs tool-first engineering approaches.

A researcher says there’s a faster way to think about splitting huge AI models across machines: use einsum, a compact math shorthand, to reason about sharding and even the “backwards” training step without memorizing tricky flips. The crowd immediately split—ironically—into camps. Fans cheered the elegance: write something like “bi,oi->bo” and just swap the index sets to get gradients, no guesswork. Skeptics rolled their eyes: those symbols look like alien runes, and real engineers just draw boxes or lean on high‑level tools. One commenter joked they now “dream in ‘ij,jk->ik’,” while another posted a meme of a decoder ring next to the einsum docs. Gatekeeping allegations flew, with newbies pleading for pizza-slice explanations of sharding (“it’s just dividing the data pie”), and purists insisting math literacy beats clicky diagrams. The pragmatic crew chimed in: let the compiler and frameworks handle it, you just write code—no need to cosplay as Einstein. Still, the stakes are big: getting sharding right means fewer bugs and faster training on monster models, whether you sketch shards or type tiny letters. For deep-learning tinkerers working with DTensor, this post became the latest battleground of clarity vs cleverness, with memes, math, and mild chaos.

Key Points

  • The article advocates using einsum notation to efficiently reason about sharding in distributed tensor computations.
  • It defines contraction and free dimensions within einsum and shows how to map common operations like matrix multiply and nn.Linear.
  • A simple backward rule is provided: swap the target input’s indices with the output indices to compute gradients via einsum.
  • Sharding rules illustrated include replication propagating to replicated output, batch-dimension sharding propagating to output, and free-dimension sharding requiring replicated broadcasted inputs.
  • The discussion of contraction-dimension sharding begins but the content is truncated, indicating further rules were intended.

Hottest takes

"I, too, dream in 'ij,jk->ik'—finally someone said it out loud" — tensorFan
"If I need a decoder ring to read your code, your 'shortcut' isn't a shortcut" — boxDrawer
"Real engineers let the compiler handle sharding; this is cosplay math" — prodOpsGuy
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.