Wednesday, December 9, 2015

How to structure code for individual research projects

I have been chatting a bit about coding recently with my supervisor (Richard Everitt).  We have a new student joining the group soon, so I think Richard is thinking a bit about what skills they may need to learn, as well as what the rest of us in the group might benefit from.  I have found it surprising how much my approach to coding continues to evolve, particularly the way that I organise my code.

I started coding in R during my master's where we had a lot of relatively short assignments that involved some computer programming.  For this kind of task it was mostly possible to write a single script with all the code needed for a given assignment in it.  That is quite easy to organise.

When I got into longer research projects that involved a significant amount of coding, I found that this approach was no longer very effective.  It becomes difficult to keep track of which files are recent and being actively used and which are outdated and redundant.  It also becomes difficult to find older versions of your code (e.g. when your most recent code has stopped working for some unknown reason and you want to find what you had last week when it was working).

My first solution to this problem was simply to number my scripts, starting at 1 and working upwards (e.g. r1.r, r2.r r3.r etc.)  If you put all the scripts in the same folder and give the folder a project name then it is still relatively easy to find your code, and you have a very easy way of switching between versions provided that you increment your file-names every time you make a significant change.

However this approach is still quite restrictive in the sense that it only really works well when you can put all the code for a given project in one script / file.

After a few iterative improvements, I am currently quite happy with the following approach to organising my code:

  • Create a new folder for a new project
  • Create a subfolder for code development that contains numbered scripts.
  • Develop code in numbered scripts and then when you have an end-product save it in the parent folder and give it a name that describes what it does.
  • Put any functions that are used in multiple scripts in a functions file so that you can re-use the same version of that function in multiple scripts.  In general, the less redundant / duplicated code in your active scripts the better.
  • Create an output sub-folder to save the results of running long scripts.  After you have run a long script and generated an output file, rename it by prefixing it with a date so that it doesn't get accidentally over-written in future.
  • Use a version control system like git with BitBucket.  Version control allows you to easily switch between different versions of your code / project, and to keep a well-structured record of the changes you have made.
It has taken me a long time to become convinced that version control systems like git are worth the effort for individual research projects, but I am a convert now.  My main objection was that many of the things I wanted to do were easier to do using Dropbox (e.g. sharing code with other people, and being able to access my work on multiple computers).  You can even find old versions of your files through Dropbox.  However, in the end I found that going from Dropbox to git / Bitbucket was like going from Word to LaTeX.  Once I got the hang of it, suddenly it seemed a lot quicker and a lot less hassle in the long-term.  RStudio has nice graphical interface to git, which I now use every day for committing and pushing my code.

Version control is also useful if you ever want to work on software development in a group, or to make your code publicly available.