January 29, 2026
When symbols hit the fan
Computing Sharding with Einsum
Dev’s math hack to split giant models ignites comment war
TLDR: A developer pushes using einsum, a compact math notation, to reason about splitting big AI models and computing gradients faster. Comments erupted: some praise the clarity, others call it unreadable and say tools should handle it, highlighting a broader fight over math-first vs tool-first engineering approaches.
A researcher says there’s a faster way to think about splitting huge AI models across machines: use einsum, a compact math shorthand, to reason about sharding and even the “backwards” training step without memorizing tricky flips. The crowd immediately split—ironically—into camps. Fans cheered the elegance: write something like “bi,oi->bo” and just swap the index sets to get gradients, no guesswork. Skeptics rolled their eyes: those symbols look like alien runes, and real engineers just draw boxes or lean on high‑level tools. One commenter joked they now “dream in ‘ij,jk->ik’,” while another posted a meme of a decoder ring next to the einsum docs. Gatekeeping allegations flew, with newbies pleading for pizza-slice explanations of sharding (“it’s just dividing the data pie”), and purists insisting math literacy beats clicky diagrams. The pragmatic crew chimed in: let the compiler and frameworks handle it, you just write code—no need to cosplay as Einstein. Still, the stakes are big: getting sharding right means fewer bugs and faster training on monster models, whether you sketch shards or type tiny letters. For deep-learning tinkerers working with DTensor, this post became the latest battleground of clarity vs cleverness, with memes, math, and mild chaos.
Key Points
- •The article advocates using einsum notation to efficiently reason about sharding in distributed tensor computations.
- •It defines contraction and free dimensions within einsum and shows how to map common operations like matrix multiply and nn.Linear.
- •A simple backward rule is provided: swap the target input’s indices with the output indices to compute gradients via einsum.
- •Sharding rules illustrated include replication propagating to replicated output, batch-dimension sharding propagating to output, and free-dimension sharding requiring replicated broadcasted inputs.
- •The discussion of contraction-dimension sharding begins but the content is truncated, indicating further rules were intended.