Monday, September 7, 2015

Late Summer Musings

Yesterday I had a go at explaining what I do to someone who doesn't know much about my research area.  I always find it interesting to discover what people think statisticians do.  This person had been given the job of analysing the results of a survey that had been done in their organisation.  This is something that requires some statistical training to do well.  However I found that when it came to explaining what I am doing for my PhD, it was quite difficult to make a connection between that type of work and what I do.  It made me wonder whether the statistics community is more like a sprawling family that have some common ancestry but have ended up doing quite different things?  Or are we a group of people from disparate backgrounds that have been drawn together by the need to solve a set of common problems?

I think both of these are potentially useful metaphors for understanding the statistics community.  On the one hand there has been branching between people who are practitioners of statistics and people who study mathematical statistics.  But at the same time I would argue that all statisticians are interested in being able to draw reliable conclusions from experimental and/or observational data.

Earlier in the summer I went on another APTS (Academy for PhD Training in Statistics) week in Warwick.  The courses were Applied Stochastic Processes (given by Stephen Connor) and Computer Intensive Statistics (given by Adam Johansen).  So what were these courses about, and how is the material in them helpful for drawing reliable conclusions from data?

Applied Stochastic Processes (ASP)

Stochastic Processes is a branch of probability theory.  Some people say that people are either statisticians or probabilists.  Although there are people who study probability theory but are not statisticians, I think that all statisticians must have a little bit of a probabilist inside them!  University courses that I have seen in maths and statistics teach probability distributions before they teach statistical inference.  Understanding what a p-value is (one of the most basic tools used in statistical inference) requires you to know something about probability distributions.

More generally there are two main reasons (I can think of) why statisticians need probability theory.  One is that they study phenomena that are intrinsically random, and are therefore best described by probability distributions.  The other is that they use numerical methods that are intrinsically random, and so understanding the behaviour (e.g. convergence) of these numerical methods requires probability theory.

ASP was primarily concerned with the second area, and in particular with the convergence of MCMC algorithms, although quite a few of the ideas along the way were relevant to the first area.

Reliable numerical methods are pretty essential for drawing conclusions from data, and with MCMC it can be pretty challenging to ensure that the method is reliable.  In an ideal world, probability theory would be able to tell you whether your numerical method was reliable.  However, in the same way that mathematical theory cannot currently account for all physical phenomena, probability theory cannot currently account for all probabilistic numerical methods.  Lots of the numerical methods that I use in my work are not particularly well understood from a theoretical point of view.  However the mathematical theory that exists for special cases can provide some useful guidance for more general cases.

I would like to be more specific about how the theory is useful beyond the special cases, but it is something I am still in the process of trying to understand better.

Computer Intensive Statistics (CIS)

A lot of recent advances in statistics have been catalysed by increased computer power.  Methods that would have been impractical 20-30 years are now routine.  This helps to continually extend the scope of research questions that can be answered, both in applied statistics and in computational / mathematical statistics.

The easiest way to ensure that a numerical method is reliable (and hence enables you to draw reliable conclusions from the data) is to use one that is not computer-intensive.  For a large part of the statistical community this approach works well, so many statisticians find that they never need to use computer-intensive methods.

Returning to our ideal world again for a moment, we would like statistical theory to be able to give us formulas that return the quantities we are interested in as a function of the data.  However in practice, such results only exist for simple models.  And if the phenomena you are interested in is not well described by a simple model, it is unlikely you will be able to draw reliable conclusions without a computer-intensive method.

The main areas covered in the course were the bootstrap and MCMC.  One of the things I learnt about for the first time was the Swensen-Wang algorithm, which is used in statistical physics.  It does seem a bit like a magic trick to me, the way you can make dependent variables conditionally independent through the addition of auxiliary variables.  Worth checking out if like a bit of mathematical aerobics!