Better Activation Functions for NNUE

Chess AI ‘Swish’ Sparks Nerd War: Is the Elo bump real

TLDR: A chess engine swapped in a smoother “Swish” switch and, after penalizing noisy neurons, gained real strength on the board. The community is split: fans credit Swish, skeptics say the regularization trick did the heavy lifting, and memes are roasting the hand‑wavy explanations.

The neural network inside chess engine Viridithas just got a “Swish” glow‑up, and the comments section turned into a tech soap opera. The dev swapped the engine’s middle-layer “on/off switches” (called activation functions) for a smoother, trendy one named Swish, and after a rocky start—more lights turning on than the hardware loved—he “taxed” the noisy neurons to calm them down. Result: a cleaner evaluation scale and a real Elo boost, roughly +14 at blitz and +6 at longer games. Cue chaos.

The hype squad is yelling “Swish is the new meta!”, flexing graphs and the full write-up. Skeptics clap back: “It wasn’t Swish, it was the regularization—penalizing noisy activations did all the work.” Old-school engine fans demand weight clipping (keeping numbers in bounds), while machine learning folks roast the hand-wavy explanation for why dense activations spiked. Performance nerds obsess over “sparsity,” aka keeping most lights off for speed, after it fell from 70% to 50% before the fix.

Memes? Oh yes. “Swish swish, Elo fish,” “Hard-Swish sounds like an energy drink,” and “tax the rich neurons” are everywhere. Someone even launched a mini culture war: CReLU boomers vs Swish zoomers. And with a tease of “SwiGLU next?”, the crowd is already loading popcorn for the sequel.

Key Points

  • The author replaced SCReLU with Swish (Hard-Swish approximation, β=1/6) in L₁ and L₂ of Viridithas 19’s NNUE.
  • Initial Hard-Swish training reduced block-sparsity in L₁ (from ~70% to ~50%), harming inference due to denser activations.
  • Adding an L1 norm penalty on L₀ outputs restored and slightly improved block-sparsity relative to the unregularized SCReLU baseline.
  • Swish-based networks showed smoother evaluation distributions and achieved Elo gains over SCReLU: +13.77 ± 5.04 (short TC) and +5.90 ± 3.11 (long TC).
  • State-of-the-art elsewhere favored CReLU (L₁) and SCReLU (L₂), and Viridithas’s lack of post-L₁ weight clipping may interact with activation choices.

Hottest takes

"Swish didn’t win Elo, the sparsity tax did" — quant_karen
"Hand-waving isn’t science—show the math or it’s just vibes" — proof_or_GTFO
"Swish is the new meta; CReLU boomers in shambles" — blitz_bro
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.