February 25, 2026
When letters start catfishing
Confusables.txt and NFKC disagree on 31 characters
Unicode drama: 31 look-alike letters, angry devs, and the “toxic text” crowd
TLDR: A study found 31 Unicode characters where the “look‑alike” list conflicts with standard normalization, flipping ſ and styled I’s in surprising ways. The comments explode into safety vs inclusivity: block confusables or risk impersonation, with some calling that Latin-only gatekeeping and others treating all user text as hazardous.
Unicode’s newest kerfuffle: a deep dive found 31 characters where the official “look‑alike” list (TR39) fights with standard normalization (NFKC). The spicy example: the Long S (ſ) looks like “f” but NFKC says it really means “s.” And all those styled capital I’s that confusables.txt maps to lowercase L? NFKC flattens them to plain I, which becomes “i” when lowercased. If you normalize first (as GitHub, ENS, and Unicode IDNA recommend), some confusable rules become “dead code,” or worse, wrong—turning teſt into teft. That’s the plot twist behind the 31-character list.
But the comments stole the show. One camp blasts the “reject confusables” advice as user‑hostile, calling it sneaky Latin‑only gatekeeping. The opposite camp goes full bunker: “Treat user text as toxic waste”—lock it in a fixed box, clip the overflow, and stop the catfishing. Pragmatists ask, “Whose confusion are we guarding against?” while process nerds question whether maintaining two lists is worth the complexity. Jokes flew: Il1 paranoia memes, “Administrator” catfish handles, and 800‑pixel usernames built from decorative marks. Some dream of a broader, search‑friendly “fuzzy” equivalence; others shrug and say normalize, lowercase, move on. It’s Unicode’s reality TV: looks vs meaning, safety vs inclusivity, and everyone thinks their font is the hero.
Key Points
- •The article identifies 31 characters where confusables.txt (UTS #39) and NFKC normalization map to different Latin letters.
- •NFKC normalization collapses compatibility variants (fullwidth, ligatures, styled characters) to canonical forms and is recommended as the first step.
- •The long s (ſ, U+017F) maps to “f” in confusables.txt but to “s” in NFKC, illustrating a semantic vs. visual mapping conflict.
- •Many capital I variants are mapped to lowercase “l” by confusables.txt but normalize to “I” under NFKC, yielding “i” after lowercasing.
- •Applying NFKC before confusable mappings prevents errors; applying confusables first can produce incorrect results (e.g., “teſt” → “teft”).