Research in Reading

Monday, May 7, 2018

Navigating the C++ forest

I have recently been working my way through 'C++ Primer', 5th edition by Lippman, Lajoie and Moo. The front cover says the book has been a bestseller since 1986 and also that the most recent edition has been completely rewritten for the new C++11 standard. The authors have worked in leading companies and laboratories such the Bell Laboratories, Pixar, Microsoft, IBM and AT&T. And they have worked closely with the creator of C++, Bjarne Stroustrup. The contents of the book live up to the billing on the cover!

I learnt some C++ a long time ago with the help of a book called 'C++: A Beginner's Guide', 2nd edition by Schildt and a university course based on the book 'Guide to Scientific Computing in C++' by Pitt-Francis and Whiteley. These books are both fairly introductory. I would recommend them, not least because they are both much shorter than C++ Primer.

The advantages of C++ Primer, in my opinion, are that it gives a more comprehensive overview of the language and has a strong emphasis on Modern C++, e.g., features that were introduced in the C++11 standard. Somehow C++ Primer achieves this without being dry!

Another popular C++ book is Bjarne Stroustroup's 'The C++ Programming Language.' I borrowed this from my university library before I bought C++ Primer and didn't really like it. I think it covers advanced material in more depth than C++ Primer, but it reads much more like a reference book than a tutorial. It is longer than C++ Primer and I suspect it contains a lot of material that is not relevant to what I'm doing.

The main thing that I've learnt over the last few months is that C(++) is not really one programming language.

Although you can compile and run programs written in C, old-style C++, and Modern C++ using the same compiler, they look and feel very different from each other. When I only knew C and old-style C++ I found it very difficult to understand programs written in Modern C++.

The C language first appeared in 1972. If you write in C then you have to learn about pointers and use arrays if you want to do calculations with vectors and matrices. Dynamic memory allocation is particularly awkward to implement. There are not that many concepts to learn, and I think it is not a bad language to learn if you are prepared to learn some basic low-level programming and want to write code that runs quickly. Because C is so simple it tends to be easier to interface with other languages, such as Fortran and Matlab.

C++ introduces features that are not available in C such as function overloading, inheritance, and the idea of data abstraction. It is designed to facilitate Object-Oriented Programming (OOP). Code written in OOP separates interface from implementation. It is designed to give full control to library developers and stop users from doing anything stupid. I generally find code written in OOP difficult to read. I like to work in an environment where I can implement algorithms myself, but also have access to a library of algorithms that I can call easily. Old-style C++ does not really seem to be set up for this.

In contrast, it is relatively easy to learn Modern C++, at least to the level where you can use it for simple programming tasks. There doesn't seem to be a precise definition of Modern C++, but this page has a fairly good summary - https://docs.microsoft.com/en-gb/cpp/cpp/welcome-back-to-cpp-modern-cpp. I think it is fair to say that I knew almost nothing about Modern C++ from the initial learning I did (which was around 10 years ago).

Although I am not generally a fan of new standards for programming languages, I have noticed that a lot of answers on Stack Overflow for C++ questions use C++11. Often solutions for different standards are presented and the C++11 standard looks nicer / more elegant.

If you want to write programs that run fast without having to learn much / anything about low-level programming, Modern C++ is probably the way to go. The vector and string classes in the Standard Library implement dynamic memory allocation without the user having to know anything about how this is implemented. So it is possible to write clean looking programs that are actually doing quite sophisticated operations.

Really understanding Modern C++, to the point where you can write high-quality libraries requires a lot of effort. To give you some idea, C++ Primer starts with a section called The Basics, which runs to about 300 pages. The reason it is long is because there is a lot to cover, not because the writing is verbose! Object-Oriented Programming (including concepts such as inheritance) is not discussed until p600. And the section on Advanced Topics starts on p715.

In summary, the world of C++ is difficult to navigate, at least in my experience. C++ Primer is the best guide that I have found. Chapter 1 of C++ Primer on Getting Started (only 30 pages) is particularly impressive for the range of ideas that it introduces. I would recommend C++ Primer to people who are literally just getting started with programming as well as to people who want to write high-quality libraries.

Thursday, March 29, 2018

an ancient annal of computer science

Over the last year I have been interested in developing my programming / coding, to get to the point where I can be more confident of sharing my code with other people. And also to be able to contribute to general purpose numerical / statistical software.

As part of this effort I have dipped in to The Art of Computer Programming (TAOCP) by Donald Knuth. The cover says "this multivolume work is widely recognized as the definitive description of classical computer science." American Scientist listed it as one of the 12 top physical-science monographs of the 20th century alongside monographs by the likes of Albert Einstein, Bertrand Russell, von Neumann and Wiener - http://web.mnstate.edu/schwartz/centurylist2.html.

I am sure there are many other books that cover similar material at a more introductory level, but I find something exciting about going back the source and reading an author who was personally involved in fundamental discoveries and developments.

There are also probably more modern accounts of computer programming that better reflect more recent innovations. Knuth himself encourages readers of TAOCP to look at his more recent work on Literate Programming. But I also think it is worth dwelling on things that have proven to be useful to a wide range of people over an extended period of time.

I have Volume 1 in the Third Edition of TAOCP, published in 1997, which is already prehistoric in some senses - it is before Google was founded (1998) and way before Facebook was launched (2004). However parts of the book date a lot further back than that - Knuth's advice on how to write complex and lengthy programs was mostly written in 1964!

Here is a summary of that advice (p191-193 of TAOCP Volume 1),

Step 1 : develop a rough sketch of the main top-level program. Make a list of subroutines / functions that you will need to write. "It usually pays to extend the generality of each subroutine a little."
Step 2 : create a first working program starting from the lowest-level subroutines and working up to the main program.
Step 3 : Re-examine your code starting from the main program and working down studying for each subroutine all the calls made on it. Refactor your program and subroutines.

Knuth suggests that at the end of Step 3 "it is often a good idea to scrap everything and start again". He goes on to say "some of the best computer programs ever written owe much of the success to the fact that all the work was unintentionally lost, at about this stage, and the authors had to begin again." - quite a thought-provoking statement!

Step 4 : check that when you execute your program, everything is taking place as expected, i.e., debugging. "Many of today's best programmers will devote nearly half their programs to facilitating the debugging process in the other half; the first half, which usually consists of fairly straightforward routines that display relevant information in a readable format, will eventually be thrown away, but the net result is a surprising gain in productivity."

I don't know whether today's best programmers still do this. I know some pretty good programmers and have been surprised how much effort they devoted to the kind of activity that Knuth is describing. Personally I now rely quite a lot on the debugger in Visual Studio, and (indirectly) on compilers to give me most of the debugging information I need for not much effort.

Friday, March 2, 2018

Pensions for professors

It is not often that universities make front page news but the recent strike by university lecturers seems to have got quite a lot of media coverage.

On the surface it looks like quite a straight-forward dispute about money. University vice-chancellors (represented by a body called Universities UK) are proposing to reduce the pensions that university staff will receive in the future. The reason they are doing this is that existing contributions to the pension fund for universities (the USS) are not expected to cover the cost of future pensions.

One political commentator, who I have a lot of respect for, Daniel Finkelstein, has said that lecturers are striking against themselves. He argues that increased contributions from universities to the USS would have a damaging effect on university lecturers. As a result of increasing contributions, universities would have to either pay lecturers a lower salary and/or employ fewer of them.

He also argues that it would be unfair for the government to increase funding to universities in order to pay generous pensions at a time when the NHS is strapped for cash, prisons seem to be nearing a state of anarchy and universities are already generously funded by students through expensive tuition fees. A large chunk of these tuition fees may end up being paid by the government if students are unable to pay back their loans.

While I find this line of reasoning quite persuasive, it seems to be predicated on the assumption that there will be an indefinite squeeze on the nation's finances. As country we have had around 7 years of government austerity. Recent news suggests that this austerity has been successful in eliminating the government deficit from around £100bn a year down to zero - https://www.ft.com/content/3f7db634-1cac-11e8-aaca-4574d7dabfb6.

So will the squeeze be indefinite or are we approaching the end of it? Nobody really knows. As of 12 months ago, the OBR, which produces official forecasts of the government deficit, was still forecasting a large deficit for 2018-19. But tax receipts have been a lot stronger than expected. Speaking from personal experience, these things are difficult to forecast!

My view is that economic growth and tax receipts will be stronger than they have been for much of the last 10 years. As a result, the USS will probably not run out of money and if it does, the government should inject some extra cash to keep it afloat. There are many competing spending priorities for the government, but I think that attracting and retaining bright people across the public sector is essential. While there are many who are drawn to the public sector purely with a desire to contribute to society, generous public sector pensions do play a big role in encouraging people to stay. I think these pensions should continue so that public services can flourish as they ought to.

Friday, December 8, 2017

tools for writing code, life, the universe and everything - can anything beat emacs?

I have recently finished a 6 month placement with NAG (the Numerical Algorithms Group) based in Oxford. One of the things I picked up there was how to use emacs for writing code and editing other text.

Previously I have always written code in programs that are designed for specific languages, such as RStudio or Matlab.

Emacs is designed to be a more generic tool that, in principle, can be tailored to any kind of text editing, including coding. As a popular open source project emacs has many contributed packages. I used it mainly for writing code in Fortran, but it has modes for pretty much every widely used programming language. I also used it for writing LaTeX and for writing / editing To Do lists using Org mode.

Beyond it's usefulness as a text editor emacs has many other functions. For example it has a shell, which behaves similarly to a command-line terminal but with the useful property that you can treat printed output as you would any other text. I find myself quite frequently wanting to copy and paste from terminal output, or to search for things, such as error messages. This is quick and easy in emacs.

So will I ever use anything other than emacs again,... for anything? I think truly hardcore emacs fans do use it for literally everything - email, web browsing, even games emacs -amusements. But I am not part of that (increasingly exclusive) club. I find emacs a pain for things that you do infrequently - a shortcut isn't really a shortcut if you have to use google to remind you what it is!

I think the two main selling points of emacs are (i) anything that you do repeatedly using a mouse, you will be able to do at least as quickly in emacs, (ii) it does great syntax highlighting of pretty much any kind of text.

Thursday, March 30, 2017

Statistics in medicine

Last week I went to the AZ MRC Science Symposium organised jointly by Astra Zeneca and the MRC Biostatistics Unit. Among a line-up of great speakers was Stephen Senn, who has an impressively encyclopaedic knowledge of statistics and its history, particularly relating to statistics in medicine. Unfortunately his talk was only half an hour and in the middle of the afternoon when I was flagging a bit, so I came away thinking 'that would have been really interesting if I had understood it.' In terms of what I remember, he made some very forceful remarks directed against personalised medicine, i.e., giving different treatments to different people based on their demography or genetics. This was particularly memorable because several other speakers seemed to have great hopes for the potential of personalised medicine to transform healthcare.

His opposition to personalised medicine was based on the following obstacles, which I presume he thinks are insurmountable.

Large sample sizes are needed to test for effects by sub-population. This makes it much more expensive to run a clinical trial than the more traditional case where you only test for effects at the population level.
The analysis becomes more complicated when you include variables that cannot be randomized. Most demographic or genetic variables fall into this category. He talked about Nelder's theory of general balance which can apparently account for this in a principled way. Despite being developed in the 1970's it has been ignored by a lot of people due to its complexity.
Personalised treatment is difficult to market. I guess this point is about making things as simple as possible for clinicians. It is easier to say use treatment X for disease Y, instead of use treatment X_i for disease variant Y_j in sub-population Z_k.

Proponents of personalised medicine would argue that all these problems can be solved through the effective use of computers. For example,

Collecting data from GPs and hospitals may make it possible to analyse large samples of patients without needing to recruit any additional subjects for clinical trials.
There is already a lot of software that automates part or all of complicated statistical analysis. There is scope for further automation, enabling the more widespread use of complex statistical methodology.
It should be possible for clinicians to have information on personalised effects at their fingertips. It may even be possible to automate medical prescriptions.

It's difficult to know how big these challenges are. Some of the speakers at the AZ MRC symposium said things along the lines of 'ask me again in 2030 whether what I'm doing now is a good idea.' This doesn't exactly inspire confidence, but at least is an open and honest assessment.

As well as commenting on the future, Stephen Senn has also written a lot about the past. I particularly like his description of the origins of Statistics in chapter 2 of his book 'Statistical Issues in Drug Development',

Statistics is the science of collecting, analysing and interpreting data. Statistical theory has its origin in three branches of human activity: first the study of mathematics as applied to games of chance; second, the collection of data as part of the art of governing a country, managing a business or, indeed, carrying out any other human enterprise; and third, the study of errors in measurement, particularly in astronomy. At first, the connection between these very different fields was not evident but gradually it came to be appreciated that data, like dice, are also governed to a certain extent by chance (consider, for example, mortality statistics), that decisions have to be made in the face of uncertainty in the realms of politics and business no less than at the gaming tables, and that errors in measurement have a random component. The infant statistics learned to speak from its three parents (no wonder it is such an interesting child) so that, for example, the word statistics itself is connected to the word state (as in country) whereas the words trial and odds come from gambling and error (I mean the word!), has been adopted from astronomy.

Monday, March 6, 2017

Pushing the boundaries of what we know

I have recently been dipping into a book called 'What we cannot know' by Marcus du Sautoy. Each chapter looks at a different area of physics. The fall of a dice is used as a running example to explain things like probability, Newton's Laws, and chaos theory. There are also chapters on quantum theory and cosmology. It's quite a wide-ranging book, and I found myself wondering how the author had found time to research all these complex topics, which are quite different from each other. That is related to one the messages of the book - that one person cannot know everything that humans have discovered. It seems like Marcus du Sautoy has had a go at learning everything, and found that even he has limits!

I think the main message of the book is that many (possibly all) scientific fields have some kind of knowledge barrier beyond which it is impossible to pass. There are fundamental assumptions which, when you assume they are true, explain empirical phenomena. The ideal in science (at least for physicists) is to be able to explain a wide range of (or perhaps even all) empirical phenomena from a small set of underlying assumptions. But science cannot explain why its most fundamental assumptions are true. They just are.

This raises an obvious question: where is the knowledge barrier? And how close are we to reaching it? Unfortunately this is another example of something we probably cannot know.

In my own field of Bayesian computation, I think there are limits to knowledge of a different kind. In Bayesian computation it is very easy to write down what we want to compute - the posterior distribution. It is not even that difficult to suggest ways of computing the posterior with arbitrary accuracy. The problem is that, for a wide range of interesting statistical models, all the methods that have so far been proposed for accurately computing the posterior are computationally intractable.

Here are some questions that could (at least in principle) be answered using Bayesian analysis. What will earth's climate be like in 100 years time? Or, given someone's current pattern of brain activity (e.g. EEG or fMRI signal) how likely are they to develop dementia in 10-20 years time?

These are both questions for which it is unreasonable to expect a precise answer. There is considerable uncertainty. I would go further and argue that we do not even know how uncertain we are. In the case of climate we have a fairly good idea of what the underlying physics is. The problem is in numerically solving physical models at a resolution that is high enough to be a good approximation to the continuous field equations. In the case of neuroscience, I am not sure we even know enough about the physics. For example, what is the wiring diagram (or connectome) for the human brain? We know the wiring diagram for the nematode worm brain - a relatively recent discovery that required a lot of work. The human brain is a lot harder! And even if we do get to the point of understanding the physics well enough, we will come up against the same problem with numerical computation that we have for the climate models.

There is a different route that can be followed to answering these questions, which is to simplify the model so that computation is tractable. Some people think that global temperature trends are fitted quite well by a straight line (see Nate Silver's book 'The signal and the Noise'.) When it comes to brain disease, if you record brain activity in a large sample of people and then wait 10-20 years to see whether they get the disease, it may be possible to construct a simple statistical model that predicts people's likelihood of getting the disease given their pattern of brain activity. I went to a talk by Will Penny last week, and he has made some progress in this area using an approach called Dynamic Causal Modelling.

I see this as a valuable approach, but somewhat limited. For its success it relies on ignoring things that we know. Surely by including more of what we know it should be possible to make better predictions? I am sometimes surprised by how often the answer to this question is 'not really' or 'not by much'.

The question of what is computable with Bayesian analysis is still an open question. This is both frustrating and motivating. Frustrating because a lot of things that people try don't work, and we have no guarantee that there are solutions to the problems we are working on. Motivating because science as a whole has a good track record of making the seemingly unknowable known.

Tuesday, December 6, 2016

Writing tools, silence & parenthood

Collaborative writing tools

I have been working on a paper recently with two co-authors. It has been a bit of a challenge finding the right pieces of software that will allow us to track edits while remaining in LaTeX. When I worked in the civil service, Word was the de facto software for producing written documents. It was a lot better than I thought it would be, and I still think the Track Changes functionality beats everything else I have tried hands down when it comes to collaborative editing. I also learnt that, using Word, you can produce documents with typesetting that looks professional, if you know what you are doing, and if someone has invested the time in creating a good template for your needs. However in the last couple of years I have returned to LaTeX, because it is what mathematicians use, and because I find it better for equations, and for references.

In the last few weeks I have been trying out Overleaf. This is one of a handful of platforms for LaTeX documents with collaboration tools. As with a lot of good user-friendly pieces of software you have to pay to get the most useful features. With Overleaf, the free service provides a workable solution. Overleaf allows you to access your LaTeX documents through a web browser, and multiple people can edit the same online version. In the free version there are some basic bells and whistles, like being able to archive your work. I found this a bit confusing at first because I thought it was setting up multiple active versions with some kind of forking process. However this is not the case.

By combining Overleaf with git I have been able to fork the development process: I can edit one branch on my local computer (using my preferred LaTeX editor and compiler), while another person edits a different branch in the online version, or potentially on another computer. Using git also makes it easy to create a change log, and visualise differences between different versions, although this doesn't work quite as well for paragraphs of text as it does for code. Unless you put lots of line breaks into your paragraphs, you can only see which paragraphs have changed, and not which individual sentences have changed.

In the news...

2016 is drawing to a close and it has been a pretty shocking year for a lot of people in terms of national and global news. In the last few weeks, I have found an increasing tendency for people to be silent - to not want to talk about certain issues any more (you know what I mean - the T word and the B word). I guess this is partly because some topics have been talked to death, and nothing new is emerging, while a lot of uncertainty remains. However I also find it a bit worrying, that people may no longer be capable of meaningful engagement with people of different opinions and backgrounds. One thing I have become more convinced of over the last year is that blogs and tweets etc. are not a particularly helpful way of sharing political views (a form of silent outrage!?) So maybe the less I say here the better, even though I do remain passionately interested in current affairs and am fairly opinionated.

And in other news...

I have a baby boy! Born 4 weeks ago - both him and my wife are doing well. In the first 2 weeks I took a break from my PhD, and it was a bit like being on holiday, in that we had a lot of time, and a lot of meals cooked for us (by my wonderful mum). It hasn't all been plain sailing, but I am now under oath not to share the dark side on parenthood - especially not with non-parents, in case it puts them off! The last 2 weeks I have been getting back into my PhD. It is quite hard finding a schedule that works. We have a routine where he is supposed to be more active and awake between 5pm and 7pm, so that he sleeps well between 7pm and 7am. I have been trying to do a bit of work after he is settled in the evening and found it fairly challenging to be motivated and focused at that time. I have been wondering whether it would work better to try and get up before him in the mornings. I guess it will probably be challenging either way.