November 12, 2025

When your money bot gets the jitters

LLM Output Drift in Financial Workflows: Validation and Mitigation (arXiv)

Study says small AI stays steady; big models wobble—and the crowd is divided

TLDR: A study finds small AI models give repeatable answers for finance tasks while a mega model often drifts. Commenters clash: some cheer audit-friendly consistency, others argue repeatability isn’t accuracy, and a loud crowd says don’t use chatbots for finance at all—build stable software instead.

The new arXiv paper sparks a banker-level brawl: researchers say smaller AI models deliver consistent, repeatable answers for financial tasks when set to “greedy” mode (temperature 0), while a flashy 120B-parameter giant stumbles—only 12.5% consistent. They tested five models across 480 runs, showed SQL stays solid even when you turn up the creativity, but RAG (aka search-and-summarize) gets wobbly fast. They even built a compliance-ready harness with SEC filings ordering, fixed seeds, and checks that keep math within ±5%. Translation: bigger isn’t always better when auditors are watching.

Cue the comment section meltdown. raffisk flexes the numbers and governance angle: small wins the audit, but warns this measures reproducibility, not truth. measurablefunc drops the cynic’s mic: “they’re Markov chains”—you wanted consistency from a dice roll? Meanwhile, 34679 goes full pragmatist: don’t run money on chatbots; use them to write real software because “software doesn’t drift.” Memes fly: “your bank bot woke up on the wrong side of the cache,” “diva model vs hall-monitor model.” The biggest drama? Whether finance should chase perfectly repeatable AI or actually correct outcomes—because the study shows you can lock behavior down, but the crowd’s still asking: will it be right when it matters?

Key Points

  • Study finds smaller LLMs (Granite-3-8B, Qwen2.5-7B) achieve 100% determinism at T=0.0, while GPT-OSS-120B achieves 12.5% consistency, with a significant gap (p<0.0001).
  • Introduces a deterministic, finance-calibrated test harness using greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering.
  • Implements invariant checking for RAG, JSON, and SQL with ±5% materiality thresholds and SEC citation validation, plus a three-tier model classification system.
  • Across 480 runs and five models, SQL tasks remain stable even at T=0.2; RAG tasks show 25–75% drift, indicating task-dependent sensitivity.
  • Cross-provider validation shows determinism can transfer between local and cloud, and the framework maps to FSB, BIS, and CFTC requirements for compliance.

Hottest takes

“Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0” — raffisk
“You can not expect consistent results & outputs” — measurablefunc
“Don't use LLMs for financial workflows” — 34679
Made with <3 by @siedrix and @shesho from CDMX. Powered by Forge&Hive.