Monday, November 20, 2006

Lying Statistics, Or Not?

The national unemployment rate is at a low 4.7%,” so said Elaine L. Chao, Secretary of Labor during a speech
delivered at the National Academies. The occasion was the Convocation on Rising Above the Gathering Storm: Energizing and Employing Regions, States, and Cities held on Sep 28, 2006.

That’s one statistic that should make the doubters of the American economy cringe. We have often heard the refrain, “Statistics don’t lie”, when they are used to justify a cause. But don’t they?

In his book, How to Lie with Statistics, Darrell Huff cautions that "The secret language of statistics, so appealing in a fact-minded culture, is employed to sensationalize, inflate, confuse, and oversimplify."

Welcome to the world of statistics that helps us make sense of the huge amount of data that we collect. Averages, correlations, regression, and various coefficients of statistical measures now dominate every presentation, be it sales pitch, forensics, and the state of environment.

In my own work, I use statistics in calibrating a computer model to reproduce real-world events, in this case it could be a tidal flow regime. Computer models are necessarily approximations due to our incomplete understanding of the physical processes involving a cascade of scales, both spatial and temporal, and the absence of close-form solutions for the governing equations.

The approximation introduced and the averaging, both over time and space, which implies that phenomena occurring within a shorter time span and smaller than a certain length scale will not be simulated, are aimed at making the problem tractable and amenable to numerical solutions. These omissions are accounted for, at least partially, by introducing empirical parameters, which are called fudge factors by skeptics.

The purpose of the model calibration is to vary these fudge factors within, hopefully, a physically meaningful range, so that model results fit well with observations and measurements of some variables (water elevations and flow velocity in the case of tidal modeling). Often times the variations of these variables over a time period (time series) at several locations within the domain of interest are used as the basis for comparison. The easiest comparison is done graphically by eyeballing the goodness of fit of the modeled curve to the measured curve.

A more elaborate, and supposedly objective, means of evaluating the goodness of fit is to employ statistical analysis such as computing the correlation or the root-mean-square error. That, in a nutshell, is how I employ statistics in one aspect of my work.

Thus satisfied, the whole process is repeated for another independent set of observation, but with the values of the “calibrated” fudge factors held constant. Only when this verification phase is satisfactorily concluded is the model deemed validated, and can hence be used in the prognostic mode with some measure of confidence.

Regardless of whether statistics is employed to illustrate a national average such as that cited by the Secretary of Labor or to validate a numerical model in my work, it is just a tool to represent a state of affair concisely, and perhaps simplistically, so that human minds can make sense of it and thus make informed decision.

In that sense, I do believe that statistics don't lie, unless the data used are cooked or massaged. Or the samples are not representative or biased. And that is precisely where one can lie with statistics.

No comments: