March 30, 2026

Clean search, spicy comment war

An NSFW Filter for Marginalia Search

Marginalia adds a 'keep it clean' switch—fans argue “Is this AI?”

TLDR: Marginalia is adding an optional NSFW filter using a simple neural net trained from data labeled by open‑source bots. Commenters split: some say this breaks the site’s “no AI” vibe, others push heavier models or self‑labels—but speed and CPU limits keep it lean. It matters for indie search speed.

Marginalia Search just flipped on an optional “keep it clean” filter, and the comments instantly turned into a spicy taste test. The indie engine’s creator says they tried everything—old-school word lists, a fast Facebook-era classifier, even letting open‑source bots label tens of thousands of pages—before settling on a tiny, from‑scratch neural network that runs fast on everyday CPUs. Why? Speed over splashy tech.

That “no AI” slogan on Marginalia Search lit the fuse. One newcomer quipped that with this change, the promise is broken, sparking a semantics brawl: is a simple neural net “AI,” or just clever math? Others piled on with armchair solutions. One camp waved in modern tricks—“use embeddings and decision trees,” “fine‑tune a BERT,”—while the maintainer reminded folks those won’t fly when you need results in milliseconds.

Down another thread, a pragmatic voice asked why sites don’t just label themselves as adult. The chorus reply: cute idea, almost nobody does. Meanwhile, the builder called the whole effort a “meandering project,” which the crowd translated as: it’s hard to teach a robot what’s risqué without slowing search to a crawl. Oh, and the first fast model flopped because the training samples were too “saucy‑adjacent,” proving garbage in, garbage out. The vibe? Admiring the hustle, debating the label, and joking that robots are now taste‑testing the internet so we don’t have to.

Key Points

  • Marginalia Search is developing an optional NSFW filter to improve over prior domain-based (UT1 list) filtering.
  • CPU speed and low-latency requirements eliminated transformer-based models from consideration.
  • An early fastText-based classifier underperformed, partly due to biased training data collected via NSFW-leaning queries.
  • About 10,000 samples (roughly 60/40 NSFW/SAFE) were labeled using open-source LLMs (Qwen 3.5 via ollama) to build training data without manual effort.
  • The project is converging on a custom, single hidden layer neural network implemented from scratch for fast CPU inference.

Hottest takes

"decision trees on the embedding vector (e.g. catboost) tend to work pretty well" — ChadNauseam
"Self-labeling seems valuable in some ways" — 8organicbits
"With this change, that is no longer true though, is it?" — VorpalWay
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.