Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

MacBook runs a giant AI—but Reddit is split between awe and eye-rolls

TLDR: A dev ran a massive 397B-parameter AI on a 48GB MacBook by streaming the model from the SSD, hitting around 5–7 tokens per second. The community is split between cheering the feat and doubting real-world usability, with debates over SSD limits and calls for solid benchmarks.

A developer claims they ran a colossal 397-billion-parameter AI on a 48GB MacBook Pro, streaming a 209GB model from the laptop’s SSD and spitting out text at about 5–7 tokens per second. Translation: the laptop only loads the bits it needs on the fly, thanks to clever compression and custom code in Apple’s Metal graphics tech. Cue community fireworks. Over on /r/LocalLLaMA, the thread is buzzing—link here. The hype squad is shouting “this is the way” and demanding benchmarks, while skeptics slam the brakes: “Mac users should not get too excited,” warning the speed (tokens-per-second, aka how fast text comes out) isn’t great for real use. The SSD takes center stage in the drama: some say it’s the bottleneck, puzzling over reported read speeds, while others dream of a Linux path that uses memory instead of the drive—one even jokes about the triumphant return of AI on ROMs, like retro game cartridges. Meanwhile, memes abound: people dubbing it a “MacBook data center,” joking the SSD is “screaming,” and quipping “no Python, only pain” as the project flexes pure C and hand-tuned shaders. It’s a rare mix of engineering flex and comment-section chaos, and we’re here for it.

Key Points

•Flash-MoE runs the 397B-parameter Qwen3.5-397B-A17B MoE model on a 48GB MacBook Pro by streaming expert weights from SSD.
•2-bit expert quantization reduces expert storage to 120GB (from 209GB 4-bit) and achieves 5.55 tokens/sec; peak warm-token speed is 7.05 tokens/sec.
•The architecture has 60 layers (45 GatedDeltaNet + 15 full attention), 512 experts/layer with K=4 active per token plus a shared expert.
•Performance relies on custom Metal kernels, Accelerate BLAS for linear attention, deferred GPU execution, and F_NOCACHE to bypass page cache.
•A per-layer pipeline averaging 3.14ms (2-bit) details GPU, CPU, and I/O stages, with on-demand parallel pread() of K=4 experts (~3.9MB each).

Hottest takes

"seems promising , this is the way , can someone benchmark this" — harshhhhhhhhh

"Mac users should not get too excited" — rvz

"the bottleneck is the SSD" — JSR_FDED

March 22, 2026

SSDs scream, dreams beam

MacBook runs a giant AI—but Reddit is split between awe and eye-rolls

TLDR: A dev ran a massive 397B-parameter AI on a 48GB MacBook by streaming the model from the SSD, hitting around 5–7 tokens per second. The community is split between cheering the feat and doubting real-world usability, with debates over SSD limits and calls for solid benchmarks.

Key Points

Hottest takes

March 22, 2026

SSDs scream, dreams beam

Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

MacBook runs a giant AI—but Reddit is split between awe and eye-rolls

TLDR: A dev ran a massive 397B-parameter AI on a 48GB MacBook by streaming the model from the SSD, hitting around 5–7 tokens per second. The community is split between cheering the feat and doubting real-world usability, with debates over SSD limits and calls for solid benchmarks.

Key Points

Hottest takes

Save News