Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Open‑source AI engine claims 3.9‑second starts — devs cheer while a Reddit thread vanishes

TLDR: ZSE, a new open‑source AI engine, claims ultra‑fast startup and big memory savings, thrilling builders who want many models on minimal hardware. Commenters cheer but press for benchmark details, and a removed Reddit thread adds intrigue, turning a speed claim into a full‑blown community spectacle.

ZSE just crashed the party promising sub‑5‑second “cold starts” — the time it takes an AI to speak its first word — and the dev crowd is loud about it. The open‑source engine says it sips memory, squeezes huge models into small GPUs, and even offers modes like speed or ultra to match your hardware. Hype kicked off when one builder gushed they’re trying to run 10 models on two GPUs and asked if the flashy numbers assume an empty graphics card. Translation: the community wants receipts, not just sizzle. The repo’s here if you’re ready to poke it: github.com/Zyora-Dev/zse.

Then the plot twist: a shared Reddit thread was… “removed_by_moderator.” Cue jokes that ZSE hit 3.9 seconds to drama. Skeptics questioned the fine print — results verified on a pricey Nvidia A100 with fast SSDs — and wondered what happens on a humble laptop. Fans clapped back that a one‑time model conversion and an OpenAI‑style API make this plug‑and‑play, while the “ultra” mode had folks memeing about running it on a potato. The vibe: excitement from tinkerers dreaming of multi‑model setups, side‑eyed benchmarking questions from realists, and a dash of mod‑mystery spice. Either way, ZSE turned a boring boot‑time stat into the week’s hottest comment‑section show.

Key Points

  • ZSE is an open-source LLM inference engine optimized for minimal memory use with high performance, featuring custom attention kernels, mixed-precision quantization, a quantized KV cache, and layer streaming.
  • Cold start benchmarks on an A100-80GB show 3.9s (Qwen 7B) and 21.4s (Qwen 32B) to first token using .zse, delivering up to 11.6× speedups over bitsandbytes.
  • Memory usage is significantly reduced: Qwen 7B needs 5.2 GB (INT4/NF4) vs 14.2 GB (FP16); Qwen 32B uses 19.3 GB (NF4) or ~35 GB (.zse) vs ~64 GB (FP16).
  • ZSE supports Hugging Face Transformers, safetensors, GGUF (via llama.cpp), and offers an OpenAI-compatible API, with developer and enterprise deployment modes.
  • Efficiency modes (speed, balanced, memory, ultra) and the Intelligence Orchestrator tailor performance to available memory; layer streaming enables running 70B models on 24 GB GPUs.

Hottest takes

"This is so freaking awesome" — medi_naseri
"run 10 models on two GPUs" — medi_naseri
"removed_by_moderator" — reconnecting
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.