Show HN: OSS implementation of Test Time Diffusion that runs on a 24gb GPU

Works on a 24GB GPU—cue flexing, gatekeeping, and memes

TLDR: Open-source tool claims it can refine long research answers by iteratively searching and rewriting, but it needs a 24GB GPU. Comments split between hype for the method and backlash over hardware access, with jokes about the Gemini-written README and calls for a Colab-friendly, CPU-friendly version.

An open-source “research agent” just dropped, promising to clean up messy drafts by repeatedly searching the web and revising, like an AI-powered editor that keeps denoising your writing until it sounds smart. The devs say it runs on a single 24GB GPU, uses speedy serving tools, and—yes—this README was “generated by Gemini,” which the comments instantly turned into a meme. The reaction? A full-blown hardware class war. 4090 owners strutted in like VIPs, while laptop folks shouted “gatekeeping!” and begged for a Colab or CPU mode. One camp claims 24GB is “consumer-tier now,” the other calls that take “Silicon Valley cosplay.” Then came the hot takes: is “Test-Time Diffusion” just fancy branding for “write, Google, rewrite”? Fans say it’s a legit strategy for long, tricky answers; skeptics call it marketing with extra steps. People loved the one-command Docker spin-up and vLLM speed, but asked for head-to-heads against GPT-4/Claude and real-world tasks beyond contests. RAG—retrieval-augmented generation—is explained like “AI that checks its homework with the internet.” Meanwhile, memes exploded: “Denoise my thesis,” “Denoise my life,” and “My browser tabs are already test-time diffusion.”

Key Points

  • TTD-RAG is an open-source implementation of Test-Time Diffusion for research report generation, submitted to the MMU-RAG competition.
  • The system performs iterative retrieval, synthesis, and denoising of a draft using the FineWeb Search API and a reranking model.
  • It serves Qwen/Qwen3-4B-Instruct-2507 and Qwen3-Reranker-0.6B via vLLM, with a FastAPI backend for high-throughput, low-latency operation.
  • The pipeline includes planning, drafting, iterative search/denoising, and final report synthesis, orchestrated by src/pipeline.py.
  • Deployment uses Docker and NVIDIA GPUs (24GB+ VRAM), supports competition-compliant endpoints, and is validated with local_test.py.

Hottest takes

"24GB is ‘consumer’ now? Tell me you’ve never shared a laptop without telling me" — thinLaptop
"It’s just ‘write, Google, rewrite’ with a Latin name" — marketing_rename
"Finally a reason to own a 4090 that isn’t ray-tracing my shame" — 4090dad
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.