March 22, 2026
SSDs scream, dreams beam
Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM
MacBook runs a giant AI—but Reddit is split between awe and eye-rolls
TLDR: A dev ran a massive 397B-parameter AI on a 48GB MacBook by streaming the model from the SSD, hitting around 5–7 tokens per second. The community is split between cheering the feat and doubting real-world usability, with debates over SSD limits and calls for solid benchmarks.
A developer claims they ran a colossal 397-billion-parameter AI on a 48GB MacBook Pro, streaming a 209GB model from the laptop’s SSD and spitting out text at about 5–7 tokens per second. Translation: the laptop only loads the bits it needs on the fly, thanks to clever compression and custom code in Apple’s Metal graphics tech. Cue community fireworks. Over on /r/LocalLLaMA, the thread is buzzing—link here. The hype squad is shouting “this is the way” and demanding benchmarks, while skeptics slam the brakes: “Mac users should not get too excited,” warning the speed (tokens-per-second, aka how fast text comes out) isn’t great for real use. The SSD takes center stage in the drama: some say it’s the bottleneck, puzzling over reported read speeds, while others dream of a Linux path that uses memory instead of the drive—one even jokes about the triumphant return of AI on ROMs, like retro game cartridges. Meanwhile, memes abound: people dubbing it a “MacBook data center,” joking the SSD is “screaming,” and quipping “no Python, only pain” as the project flexes pure C and hand-tuned shaders. It’s a rare mix of engineering flex and comment-section chaos, and we’re here for it.
Key Points
- •Flash-MoE runs the 397B-parameter Qwen3.5-397B-A17B MoE model on a 48GB MacBook Pro by streaming expert weights from SSD.
- •2-bit expert quantization reduces expert storage to 120GB (from 209GB 4-bit) and achieves 5.55 tokens/sec; peak warm-token speed is 7.05 tokens/sec.
- •The architecture has 60 layers (45 GatedDeltaNet + 15 full attention), 512 experts/layer with K=4 active per token plus a shared expert.
- •Performance relies on custom Metal kernels, Accelerate BLAS for linear attention, deferred GPU execution, and F_NOCACHE to bypass page cache.
- •A per-layer pipeline averaging 3.14ms (2-bit) details GPU, CPU, and I/O stages, with on-demand parallel pread() of K=4 experts (~3.9MB each).