Monday, October 5, 2015

Box's advice for statistical scientists

I recently discovered a new word - mathematistry, coined in 1976 by the statistician / data scientist George Box.  If I had to guess what it meant, I would think it might be something quite cool like a cross between mathematics and wizardry.  In fact, the way George Box defined it, the word is closer to a cross between mathematics and sophistry.

Box's full definition for mathematistry is as follows,

'.. the development of theory for theory's sake, which, since it seldom touches down with practice, has a tendency to redefine the problem rather than solve it.  Typically, there has once been a statistical problem with scientific relevance but this has long been lost sight of.'

According to Box, mathematistry also has a flip side which he calls 'cookbookery'.  This is

'the tendency to force all problems into the molds of one or two routine techniques, insufficient thoughts being given to the real objectives of the investigation or to the relavance of the assumptions implied by the imposed methods.'

There is so much useful advice contained in those two concepts!  Personally I find it quite easy to forget to 'touch down with practice'.  How many times have I spent a long time coming up with a solution to something, only to find that the attempt to put it into practice reveals an obvious flaw that could have been identified quite early on?  And I also find it much easier to focus on one way of solving a problem, rather than considering several different options, and investing time in evaluating which is the best approach.

I have recently been having a look at a kaggle challenge called Springleaf.  I am not particularly interested in trying to get to the top of the leaderboard, since the difference between the top 1000 entries seems pretty marginal to me, but I am interested in finding out what makes a big difference to predictive accuracy.

One of the things I have learnt is that being able to make use of all the variables in the data-set makes a big difference.  Regression trees are very easy to apply to multivariate data-sets (e.g. 100s or 1000s of variables), and I have found these to be quite effective for prediction.  There are also relatively straight-forward ways of training them to avoid over-fitting.

The other thing I have discovered is boosting.  This still seems a bit magical to me.  The basic idea is that by using an ensemble of different models and averaging predictions over them, you get a better prediction.  To me, it seems like there must be a single model that would give the same predictions as the ensemble.  In any case, I have found that boosting makes a surprisingly big difference to predictive accuracy, compared to approaches that try to fit a single model, at least for regression trees.

In The Elements of Statistical Learning by Hastie, Tibshirani & Friedman there is a performance comparison of different classification methods.  They include boosted trees, bagged neural networks and random forests, which (I think) are all different ways of creating an ensemble of models.  The one method that didn't create an ensemble was Bayesian neural nets, which uses MCMC to explore the parameter space.  The approach that came out on top was Bayesian neural networks but the ensembling methods were not too far behind.  This suggests to me that averaging over models and averaging over different parameters values within a sufficiently flexible model have a similar effect.  However it is something I would like to understand in more depth.

I would highly recommend The Elements of Statistical Learning as a great example of statistical science, that avoids both mathematistry and cookbookery.  The drawback of books is that they get out of date quite quickly, but this one is relatively recent and definitely still relevant.  The only thing that is worth being careful with is references to software packages, as these seem to evolving on a faster time-scale than the underlying concepts.