Why R² Alone Fails

Dataviz logo representing a ScatterPlot chart.

and correlation are often seen as definitive measures to validate the relationship between two variables.

This post features an interactive sandbox that explores several edge cases, demonstrating how relying on these summary statistics without visualizing the data can be dangerously misleading.

Useful links

🤔 What are R2 and correlation?

→ r2

R², or the coefficient of determination, measures the proportion of variance in the dependent variable that is explained by the independent variable in a regression model.

It ranges from 0 to 1, with higher values indicating a stronger linear relationship.

→ correlation

The correlation coefficient (r) measures the strength and direction of a linear relationship between two variables, ranging from -1 to 1. R² is actually the square of the correlation coefficient in a simple linear regression!

The correlation describes the relationship directly, R² focuses on the explanatory power of a regression model.

🎮 Scatterplot, R², and Draggable Circles

Summary statistics are popular because they condense large datasets into a few easy-to-understand numbers. However, relying solely on them can lead to a false sense of clarity.

The graph below showcases datasets with high R² and correlation values, even when there's clearly no meaningful relationship between x and y.

Bonus: the circles are draggable! Experiment by moving them around and watch how the R² and correlation change in real time. It’s a great way to build intuition about these metrics.

012345678910

R² → 0.672 Correlation → 0.819

An interactive scatterplot with linear regression line. Drag a circle to see the impact on the r2!

Flow

Contact

👋 Hey, I'm Yan and I'm currently working on this project!

Feedback is welcome ❤️. You can fill an issue on Github, drop me a message on Twitter, or even send me an email pasting yan.holtz.data with gmail.com. You can also subscribe to the newsletter to know when I publish more content!