Wednesday, July 02, 2008

Anscombe's Quartet

Anscombe's Quartet is a group of four data sets that provide a useful caution against blindly applying statistical methods to data. Each data set consists of ten x- and y-values such that the mean and variance of x and y, the correlation coefficient, regression line, and error of fit using the line are the same. But as you can see, they are clearly quite different data sets:

The x- and y-values are included at the end of this post in a Matlab-friendly format. The Quartet provides a stark lesson on how useful it can be to simply look at one's data before diving in with all sorts of statistical ninja. For instance, Set 2 can be modelled with linear regression to yield the same mean-squared-error between the regression line and the data as the other plots, but it ain't a linear relationship.

This is old hat for most of us, but I like the Quartet for its simplicity and visual impact.

These data and graphs were first presented by F.J. Anscombe in 1973 in his paper Graphs in Statistical Analysis. It is quite fun reading over the paper, which ends:
Unfortunately, most persons who have recourse to a computer for statistical analysis of data are not much interested either in computer programming or in statistical method, being primarily concerned with their own proper business. Hence the common use of library programs and various statistical packages. Most of these originated in the pre-visual era. The user is not showered with graphical displays. He can get them only with trouble, cunning, and a fighting spirit. It's time that was changed.
Thank goodness for Matlab.



The data (coded for Matlab)
x1=[10 8 13 9 11 14 6 4 12 7 5];
x2=x1;
x3=x1;
x4=[8 8 8 8 8 8 8 19 8 8 8];

y1=[8.04 6.95 7.58 8.81 8.33 9.96 7.24 4.26 10.84 4.82 5.68];
y2=[9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 4.74];
y3=[7.46 6.77 12.74 7.11 7.81 8.84 6.08 5.39 8.15 6.42 5.73];
y4=[6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 6.89];

4 comments:

Derek Slater said...

Cool.

Reminds me of this post & discussion (some while ago) on Cosmic Variance, which became quite a hoot.

Derek Slater said...

whoops. link:

http://cosmicvariance.com/2007/07/13/the-best-curve-fitting-ever/

Eric Thomson said...

Derek: that's a great example of statistics abuse!

takchess said...

it's nice to see you blogging away still.

On an unrelated note, I am finding this site to have a lot of interested reading...

www.treehugger.com