January 4, 2026
Your line is playing favorites
Why does a least squares fit appear to have a bias when applied to simple data?
Tilted trendline? Commenters say it’s not bias—you’re using the wrong tool
TLDR: The “tilted” line isn’t bias: ordinary regression only minimizes vertical errors, while PCA considers noise in both directions. Comments split between “use Deming/Total Least Squares” and “remember OLS assumptions,” underscoring a bigger lesson—your line changes with your assumptions, so pick the right tool for your data.
A coder plotted a simple best-fit line and freaked out when it looked “tilted,” then used a PCA arrow (the direction of maximum spread) that felt way more “right.” The comments pounced. The loudest take: this isn’t bias, it’s assumptions. As dllu put it, linear regression only models noise in y, not x, while PCA treats noise in both directions. sega_sai added that they minimize different things—vertical drops versus shortest-to-the-line distances—so of course they point different ways. Then came the plot twist: a stats teacher, charlieyu1, confessed they discovered this mid-lecture and “felt embarrassed,” prompting a flurry of “Stats 101, but make it spicy.”
Fixes? The thread crowned Deming regression—an “errors-in-both-axes” fit—suggested by gpcz, while tomp reminded everyone that what your eyes want is Total Least Squares (closest-distance fit). The mood was half teachable moment, half roast: “Two regressions enter, one line leaves,” joked one user, as others debated whether normalizing the data helps, or just hides the real issue. Links flew to PCA, Deming regression, and Total least squares. Verdict: not a bug—just different goals. If x is clean and you’re predicting y, OLS is fine; if both are messy, upgrade your tool and save the drama for the comments.
Key Points
- •Synthetic correlated data are generated in Python using NumPy with a dependency matrix, scaling, and mean offset.
- •A linear least-squares regression of y on x (via NumPy’s polyfit) is plotted and appears tilted relative to the data cluster.
- •The covariance matrix of the data is computed and diagonalized to find the eigenvector of maximum variance.
- •The principal eigenvector aligns with the intuitive direction of the data’s spread, differing from the regression line.
- •The author highlights that least squares minimizes vertical errors while the eigenvector approach reflects variance, raising questions about asymmetry and fit appropriateness.