Ollama is now powered by MLX on Apple Silicon in preview

Apple fans celebrate – and roast – Ollama’s new “faster but still not for peasants” Mac upgrade

TLDR: Ollama’s new update makes AI assistants run much faster directly on high‑end Apple Macs, promising private and powerful local chatbots and coding helpers. The crowd loves the direction but is split between excited power users, frustrated 16 GB Mac owners, and skeptics who say it’s just “faster but dumber” AI in disguise.

Ollama just plugged itself straight into Apple’s own brain tech on Macs, promising super‑fast on‑device AI helpers, but the real show is in the comments section. One camp is hyped, calling this the future: private chatbots and coding assistants running right on your laptop instead of some mystery server farm. User babblingfish basically declared, “local AI is the endgame,” arguing it’s safer, greener, and good enough for most people who don’t need sci‑fi level intelligence.

Then come the have‑nots with 16 GB of memory, staring through the glass like it’s the VIP section. dial9-1 sums up their pain, still “waiting for the day” they can run these tools on a normal Mac, while Ollama casually says you really want more than 32 GB. Cue the meme: “Cool feature, shame about my bank account.” Power users like LuxBennu flex in the corner, bragging they already run giant models on souped‑up Macs and are just here to compare Ollama’s new setup to their existing hacks.

The spiciest drama? Whether this “faster” AI is just marketing for smaller, dumber models. AugSun sarcastically translates the announcement as, “We can run your dumbed down models faster,” poking at claims that shrinking models “barely” hurts quality. The vibe: exciting progress, but with a side of class warfare and healthy skepticism over how much intelligence is being traded for speed.

Key Points

  • Ollama released a preview that integrates Apple’s MLX framework to accelerate local LLM inference on Apple Silicon Macs.
  • Testing with Alibaba’s Qwen3.5-35B-A3B (NVFP4) showed significant speedups versus a prior Q4_K_M setup on Ollama 0.18; Ollama 0.19 targets ~1851 tokens/s prefill and ~134 tokens/s decode with int4.
  • On Apple’s M5, M5 Pro, and M5 Max chips, GPU Neural Accelerators improve time to first token and tokens-per-second.
  • Ollama now supports NVIDIA’s NVFP4 format, reducing memory/storage needs while maintaining accuracy and enabling parity with production and NVIDIA’s model optimizer.
  • Caching has been upgraded (reuse across conversations, intelligent checkpoints, smarter eviction), and the preview requires Macs with over 32 GB unified memory and focuses on the Qwen3.5-35B-A3B coding-tuned model.

Hottest takes

"LLMs on device is the future. It’s more secure" — babblingfish
"still waiting for the day I can comfortably run Claude Code... with only 16gb of ram" — dial9-1
"We can run your dumbed down models faster" — AugSun
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.