Meta's Omnilingual MT for 1,600 Languages

Meta says it speaks 1,600 languages — commenters clap back, crack jokes, and demand receipts

TLDR: Meta unveiled a translation system covering 1,600 languages and says its smaller models rival much larger ones. Commenters are split—some report bad results in real-world, low-resource languages, others question the “omni” label and raise ethics concerns, while jokers roast bad dev-doc translations and missing paragraph breaks.

Meta just dropped an “Omnilingual” bombshell: a translation system for 1,600 languages, claiming its smaller models can rival a giant 70B-parameter model. There are fancy new datasets with bouquet-scented names (yes, BOUQuET), toxicity checks, and two model flavors built on LLaMA—think one brain that reads-and-speaks, and one that just speaks. The promise: more languages actually written coherently, not just understood.

But the comments? Spicy. A Khmer speaker in Cambodia says Meta’s translations are still weak for lesser-known languages and swears that modern AI chatbots (so-called “large language models”) give more natural, culturally aware results. Another user throws the “Omni” shade: neat flex, but it’s still far from the world’s 7,000+ languages—“first 1,000 are the hardest,” they admit, but the “omni” label feels extra.

Then the punchlines. One developer points at the broader AI translation chaos, joking about Microsoft docs that translate programming words like “try/catch” into everyday German (“versuchen/fangen”). Another quips that Meta can translate 1,600 tongues but apparently can’t master basic paragraph breaks. And the heaviest hit: a warning not to celebrate Meta’s language reach while some allege the company’s platform has harmed vulnerable communities, citing Amnesty.

Bottom line: big claim, bigger expectations. The crowd wants proof—in the wild, in low-resource languages, and with fewer faceplants—and they’re not shy about it.

Key Points

  • OMT is presented as the first machine translation system supporting more than 1,600 languages.
  • A comprehensive data strategy merges public corpora with new datasets, including MeDLEY bitext, synthetic backtranslation, and mined data.
  • Evaluation uses BLASER 3, OmniTOX, and two large human-created datasets (BOUQuET and Met-BOUQuET), with resources and a leaderboard freely available.
  • Two specialized models are explored: OMT-LLaMA (decoder-only with retrieval-augmented translation) and OMT-NLLB (encoder–decoder via OmniSONAR using non-parallel data).
  • Models with 1B–8B parameters match or exceed a 70B LLM baseline and notably improve coherent generation for undersupported languages, with further gains via finetuning and RAG.

Hottest takes

"Be careful whose sins you’re laundering" — bikeshaving
"Meta’s translations are very poor… for relatively obscure languages" — stingraycharles
"They can translate 1600 languages, but they cannot do basic text formatting" — garyclarke27
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.