• NHS-R Member Profile – Robin Hinks

    R is a powerful tool for manipulating health and care data and a lot can be learned from sharing our experiences of using R with others. We bring to you an NHS-R profile from one of our Community members, to share their insider knowledge of using R…

    Robin Hinks
    Research and Policy Officer
    Chartered Society of Physiotherapy

    How did you first meet R?
    While working as a civil service analyst, where I was encouraged to undertake self-directed learning in r to conduct statistical and geographic analysis.
    What sort of things do you use R for and what do you love about R?
    Through previous roles – where I have done quality assurance and validation of other research teams’ work – I know the value of well-documented analytical process, and the dangers of poor record keeping! I love how r allows you to keep all your analysis, research notes and – through r Markdown – reporting in one place.
    What do you hate about R?
    I have a qualitative research background and using r has been my first real exposure to code development. While I found the move from, say, SPSS’ ‘point and click’ environment easy enough, I have found it difficult to get my head round the wider world of code development that surrounds r: learning about pulls, commits, splits and the like has been challenging!
    What are your top tips for using R?
    Start by coding up some tasks you already have a process for elsewhere – e.g. automating some data transformations you’ve previously done in a programme like SPSS or Excel. Working out how to translate a task into r’s environment is a lot easier that starting from a blank slate.
    Can you please name a project where you have used R? Briefly describe what this involves.
    Health data and statistics are reported against a range of geographies that do not easily match up with the political geographies our members might seek to influence – e.g. parliamentary constituencies or local authorities. I’ve used r to develop look up tables between different geographic areas; and using the leaflet package visually map different geographic area,
    developing simple choropleth and point maps for internal insight work.

  • Annotating SPC plots using annotate with ggplot

    Annotating SPC plots using annotate with ggplot

    Statistical Process Control (SPC) charts are widely used in healthcare analytics to examine how the metric varies over time and whether this variation is abnormal. Christopher Reading has already published a blog on SPC Charts and you also can read more about SPC charts here or here.

    Here is a simple example of annotting points and text on SPC plots using ggplot2 package. We won’t explain all the parameters in the annotate function. Instead we see this as a short show and tell piece with signsposts at end of the blog.

    So let’s get started and generate some dummy data from a normal distribution with a mean of 0 and and a standard deviation of 1.

    set.seed(2020) # set the random number seed to ensure you can replicate this example
    y <- rnorm(30, 0, 1) # generate 30 random numbers for the y-axis
    y <- c(y, rep(NA, 10)) # add 10 NA's to extend the plot (see later)
    x <- 1: length(y) # generate the x-axis
    df <- tibble(x=x, y=y) # store as a tibble data frame for convenience

    Now we can plot the data using ggplot function.

    fig1 <-  
      ggplot(df, aes(x,y)) +
      geom_point(size=2) +
      geom_line() +
      ylim(-4,4) # increase the y-axis range to aid visualisation
    fig1 # plot the data

    One of the main features of SPC charts are upper and lower control limits. We can now plot this as an SPC chart with lower and upper control limits set at 3 standard deviations from the mean. Although in practice the calculation of control limits differs from this demo, for simplicity we imply control limits and a mean as set numbers. Alternatively, you could use qicharts2 package to do SPC calculations and then use the generated ggplot2 object and keep following our steps.

    fig1 <- 
      fig1 +
      geom_hline(yintercept = c(3,0,-3), linetype='dashed') +  # adds the upper, mean and lower lines
      annotate("label", x=c(35,35,35), y=c(3,0,-3), color='darkgrey', label= c("Upper control limit", "Average line", "Lower control limit"), size=3) # adds the annotations
    fig1 # plot the SPC

    Remarkably we see a point below the lower control limit even though the data are purely pseudo-random. A nice reminder that control limits are guidelines not hard and fast tests of non-randomness. We can now highlight this remarkable special cause data point which is clearly a false signal also known as special cause variation.

    fig1 <- 
      fig1 +
      annotate("point", x=18, y=df$y[18], color='orange', size=4) + 
      annotate("point", x=18, y=df$y[18])
    fig1 # plot the SPC with annotations

    We can now add a label for the special cause data point. You can play around with the vjust value (eg try -1, 0, 1) to get a feel for what it is doing to the vertical position of the label. There is also a hjust which operates on the horizontal plane.

    fig1 <- 
      fig1 +
      annotate("label", x=18, y=df$y[18], vjust=1.5, label = "Special cause variation", size=3)
    fig1 # plot the SPC with more annotations

    To learn more about the annotate function see # #https://ggplot2.tidyverse.org/reference/annotate.html #https://www.gl-li.com/2017/08/18/place-text-at-right-location/

  • SPC Charting in R

    Divisional Information Specialist, Worcestershire Acute Hospitals NHS Trust

    By Christopher Reading

    For some time the Information Department at Worcestershire Acute NHS Trust has been making use of statistical process control (SPC) charts for internal reporting purposes. This has formed part of our drive toward better decision-making (as recommended in NHSI’s Making Data Count https://improvement.nhs.uk/resources/making-data-count/).

    In doing so we have made extensive use of NHSI’s Excel-based SPC tool and have also sought to implement this methodology outside of the confines of MS Excel (ie. within our SQL/SSRS based reporting suite).

    As the Department’s unofficial ‘R Champion’, I have driven efforts to increase my team’s knowledge and usage of R over the last six months. My experience with NHSI’s resources suggested that R offered a route to more efficient, consistent and quickly reproducible SPC charting. I set about developing a charting function within R which would replicate NHSI’s logic and methodology[1].

    I developed and tested a custom function in R which requires two primary arguments: a series of data points, and a set of accompanying date values. The function then creates a data frame containing the data series, its mean and moving average values, and upper and lower control limits. The series is then tested against NHSI’s methodology, and special cause variations are highlighted and colour coded. This formatting is done according to a secondary function argument which identifies whether an increase or decrease in the series indicates system improvement. This data frame is then visualised using ggplot, which displays the SPC and any additional information such as a performance trajectory or national target.

    I then tested the function and compared against our existing SPC reporting. A few logical ‘gremlins’ in the function were identified and subsequently removed, and once I was happy with the function it was integrated into a growing departmental R package (currently only internally available) for use in R Markdown reporting and our expanding R Shiny dashboard repertoire.

    My next step was to use Shiny to create an SPC Wizard app, to enable colleagues without R knowledge to test and utilise the SPC function. The app allows users to supply CSV files containing multiple data series, and generate SPC charts with little or no effort. These can then be exported as image files for Trust reporting. The app allows users to make formatting changes to the chart such as customising main and axis titles, customising the frequency of axis labels and size of point and line geoms (chart objects) for lengthy data series. It also allows users to specify multiple data series at a time to create ‘small multiple’ SPC charts for simultaneous analysis.

    The project provided an excellent challenge in developing my Shiny skills, and provided an opportunity to utilise the visually impressive and professional appearance of the ShinyDashboard package. Development of this Shiny app also led to a challenging project of setting up a Linux based Shiny server, to allow hosting of the app for colleagues to use.

    A key advantage of this function-based approach is that the SPC methodology is now available for use by all analysts within the Department, and can be implemented with a minimum of coding. One of the primary difficulties with SQL based SPC logic encountered by our team was the length of code required to produce the chart data, and therefore the increased risk of error when recycling this code for different reports. The simplicity and self-contained nature of the SPC function avoids this.

    Having successfully tested and embedded the SPC function within an ad-hoc SPC wizard, I have continued to develop a Shiny Performance Dashboard for Cancer KPIs. This rapidly produces SPC charting for 2-Week-Wait Referral and 62-Day Cancer Treatment metrics from live data pulled from our SQL warehouse. I hope this will be the first of many dashboards to take advantage of an easily available and consistent SPC methodology, allowing our Department to create reports and dashboards which are better able to communicate the nature of changing time series to Trust decision-makers, and to track and evaluate the impact of operational management decisions.

    Despite the (at times steep!) learning curve involved, from creating the initial function and replicating NHSI’s SPC logic, to setting up the Shiny server and deploying apps for use, this project has been an excellent way to develop my R skills and to demonstrate the value in embedding use of R within our organisation, and making it part of our toolkit for ‘business as usual’ analysis.

    I hope that next steps for this project will be sharing our methodology with other NHS organisations, to allow further input and development of the methodology and reporting applications. Recently there have been discussions around a collaboration with other NHS Trusts and the Strategy Unit, regarding the possibility of developing an SPC package and shiny app to be available to all NHS organisations. If you would like to learn more or take part in the discussions, please join us on the NHS-R community slack channel (nhsrcommunity.slack.com) and let us know your thoughts on an SPC package, and what you might want to see as part of it!

    [1] For those not familiar with the Making Data Count resources, the SPC tool is based around a moving average measurement of sigma and significant variations in data based on the this value. These include the identification of any data points above or below three sigma; sequences of consecutive data points above/below the mean; runs of consecutively increasing/decreasing data points; and two out of three data points at greater (or less than) than 2 sigma.

  • NHSRdatasets meets runcharter

    This post was originally published on johnmackintosh.com but has kindly been re-posted to the NHS-R community blog.


    The NHSRdatasets package made it to CRAN recently, and as it is designed for use by NHS data analysts, and I am an NHS data analyst, let’s take a look at it. Thanks to Chris Mainey and Tom Jemmett for getting this together.

    Load packages and data

    As above let’s load what we need for this session. The runcharter package is built using data.table, but I’m using dplyr in this main section to show that you don’t need to know data.table to use it.

    library(runcharter) # remotes::install_github("johnmackintosh/runcharter)

    However- seriously, do take a look at data.table. It’s not as hard to understand as some might have you believe. A little bit of effort pays off. You can load the runcharter package from github using the remotes package. (I’ve managed to install it on Windows and Ubuntu. Mac OS? No idea, I have no experience of that).

    ae <- data("ae_attendances") # a promise
    ae <- ae_attendances #  a string
    rm(ae_attendances) # just typing 'ae' brings it to life in the environment

    That felt a bit glitchy. There has to be a sleeker way to load and assign a built in dataset but I couldn’t find one. Cursory google here.

    Let’s have a look at the data:

    ## Observations: 12,765
    ## Variables: 6
    ## $ period      <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, 2...
    ## $ org_code    <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, RQM...
    ## $ type        <fct> 1, 2, other, 1, 2, other, other, other, 1, other, ...
    ## $ attendances <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562, ...
    ## $ breaches    <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 1322...
    ## $ admissions  <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, 0,...

    Lot’s of factors. I’m actually very grateful for this package, as it caused me major issues when I first tried to plot this data using an earlier version of runcharter. I hadn’t considered factors as a possible grouping variable, which was a pretty big miss, as all the facets were out of order. All sorted now.

    There’s way too much data for my tiny laptop screen, so I will filter the data for type 1 departments – the package help gives us a great link to explain what this means

    type1 <- ae %>% 
      filter(type == 1) %>% 
    # plot attendances
    p <- runcharter(type1,
                    med_rows = 13, # median of first 13 points
                    runlength = 9, # find a run of 9 consecutive points
                    direction = "above", # find run above the original median
                    datecol = "period",
                    grpvar = "org_code",
                    yval = "attendances")

    The runcharter function returns both a plot, and a data.table/ data.frame showing a summary of any runs in the desired direction (I’m assuming folk reading this have a passing knowledge of run charts, but if not, you can look at the package vignette, which is the cause of most of my commits!!)

    Don’t try loading the plot right now, because it is huge, and takes ages. If we look at the summary dataframe,we can see 210 rows, a fairly decent portion of which relate to significant increases above the original median value

    ##      org_code median start_date   end_date  extend_to  run_type
    ##   1:      R0A  21430 2017-10-01 2018-10-01 2019-03-01  baseline
    ##   2:      R1F   3477 2016-04-01 2017-04-01 2017-05-01  baseline
    ##   3:      R1H  28843 2016-04-01 2017-04-01 2019-03-01  baseline
    ##   4:      R1K  11733 2016-04-01 2017-04-01 2019-03-01  baseline
    ##   5:      RA2   5854 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  ---                                                           
    ## 206:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 207:      RLT   6977 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 208:      RQ8   8456 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 209:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 210:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained

    Let’s use skimr to get a sense of this

    ## Skim summary statistics
    ##  n obs: 210 
    ##  n variables: 6 
    ## -- Variable type:character -----------------------------------------------------------
    ##  variable missing complete   n min max empty n_unique
    ##  run_type       0      210 210   8   9     0        2
    ## -- Variable type:Date ----------------------------------------------------------------
    ##    variable missing complete   n        min        max     median n_unique
    ##    end_date       0      210 210 2017-04-01 2019-03-01 2017-04-01        9
    ##   extend_to       0      210 210 2017-05-01 2019-03-01 2019-03-01        7
    ##  start_date       0      210 210 2016-04-01 2018-07-01 2016-04-01        9
    ## -- Variable type:factor --------------------------------------------------------------
    ##  variable missing complete   n n_unique                     top_counts
    ##  org_code       0      210 210      139 RA4: 3, RDD: 3, RDE: 3, RGN: 3
    ##  ordered
    ##     TRUE
    ## -- Variable type:numeric -------------------------------------------------------------
    ##  variable missing complete   n   mean      sd   p0     p25  p50      p75
    ##    median       0      210 210 9389.8 4317.54 3477 6468.25 8413 11311.25
    ##   p100     hist
    ##  29102 <U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>

    To keep this manageable, I’m going to filter out for areas that have median admissions > 10000 (based on the first 13 data points)

    high_admits <- p$sustained %>% 
      filter(median > 10000 & run_type == "sustained") %>%

    Then I change the org_code from factor to character, and pull out unique values. I’m sure there is a slicker way of doing this, but it’s getting late, and I don’t get paid for this..

    I use the result to create a smaller data frame

    high_admits$org_code <- as.character(high_admits$org_code)
    type1_high <- type1 %>% 
      filter(org_code %in% high_admits$org_code)

    And now I can produce a plot that fits on screen. I’ve made the individual scales free along the y axis, and added titles etc

    p2 <- runcharter(type1_high,
                     med_rows = 13, # median of first 13 points as before
                     runlength = 9, # find a run of 9 consecutive points
                     direction = "above",
                     datecol = "period",
                     grpvar = "org_code",
                     yval = "attendances", 
                     facet_scales = "free_y",
                     facet_cols = 4,
                     chart_title = "Increased attendances in selected Type 1 AE depts",
                     chart_subtitle = "Data covers 2016/17 to 2018/19",
                     chart_caption = "Source : NHSRDatasets",
                     chart_breaks = "6 months")

    Let’s look at the sustained dataframe

    ##     org_code median start_date   end_date  extend_to  run_type
    ##  1:      RCB   9121 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  2:      RDD  11249 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  3:      RDE   7234 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  4:      RGN   7912 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  5:      RJ1  12240 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  6:      RJE  14568 2016-04-01 2017-04-01 2018-05-01  baseline
    ##  7:      RJL  11262 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  8:      RQM  16478 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  9:      RRK   9584 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 10:      RTE  11303 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 11:      RTG  11344 2016-04-01 2017-04-01 2018-07-01  baseline
    ## 12:      RTR  10362 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 13:      RVV  12700 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 14:      RW6  22114 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 15:      RWE  12275 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 16:      RWF  11939 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 17:      RWP   9976 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 18:      RXC   9396 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 19:      RXH  12494 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 20:      RXP  10727 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 21:      RYR  11578 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 22:      RCB  10062 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 23:      RDD  12093 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 24:      RDE   7637 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 25:      RGN  11896 2017-05-01 2018-01-01 2018-05-01 sustained
    ## 26:      RJ1  13570 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 27:      RJE  15183 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 28:      RJL  11972 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 29:      RQM  18560 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 30:      RRK  29102 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 31:      RTE  11772 2017-05-01 2018-01-01 2018-05-01 sustained
    ## 32:      RTG  17169 2018-07-01 2019-03-01 2019-03-01 sustained
    ## 33:      RTR  10832 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 34:      RVV  13295 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 35:      RW6  22845 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 36:      RWE  18173 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 37:      RWF  12793 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 38:      RWP  10358 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 39:      RXC  10279 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 40:      RXH  13158 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 41:      RXP  11314 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 42:      RYR  11970 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 43:      RDD  12776 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 44:      RDE  15322 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 45:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 46:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 47:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained
    ##     org_code median start_date   end_date  extend_to  run_type

    And of course, the plot itself


    I haven’t looked into the actual data too much, but there are some interesting little facets here – what’s the story with RDE, RRK and RTG for example? I don’t know which Trusts these codes represent, but they show a marked increase. Of course, generally, all areas show an increase at some point.

    The RGN (top right) and RVV (mid left) show the reason why I worked on this package – we can see that there has been more than one run above the median. . Performing this analysis in Excel is not much fun after a while.

    There is a lot more I can look at with this package, and we in the NHS-R community are always happy to receive more datasets for inclusion, so please contribute if you can.

  • How NHS-R Community do The Apprentice…

    By Zoe Turner

    One of the tasks on the Apprentice a number of years ago was for the contestants to put on a corporate event, at no small cost to the people attending I might add. It’s a tale often told because one of the contestants was gluten free and no one had accounted for dietary needs amongst the contestants so the poor ‘gluten free lady’, as she was known, was served a fruit salad.

    The reason I’m apparently going off tangent so early in a blog, is that it struck me that the Apprentice is all about throwing people in at the deep end and seeing how they cope. It’s entertainment but clashes with the premise that these are potential apprentices to a ‘British business magnate’ (as Wikipedia calls him). Contrast this with NHS-R and how I came to be attending the Train the Trainer event at the end of 2019 and then helped to run the first of 3 local courses this January, having only just started learning R around 2 years ago.

    Analysts have many expectations made of them. They have to be technical, able to interpret data and communicate themselves clearly to non-analysts. Very rarely though will an analyst be expected to train others. Some may produce or present some training to support or mentor fellow analysts, and even then my experience has always been on the receiving end. Coupled with the fact that I’ve never really had a burning desire to teach, it was a surprise to find myself on a course on how to deliver the NHS-R ‘Introduction to R’ workshop.

    The reason I did it is that my involvement with NHS-R has led to this natural consequence of training others. I started with attending the course myself, then presented at the conference and facilitated an Introduction Course run by NHS-R but hosted by my Trust. I then didn’t hesitate in agreeing to taking on the training.

    NHS-R Community held their first two-day Train the Trainer event in Birmingham organised through AphA (Association of Professional Healthcare Analysts). I was supported to go on this by my manager, Chris Beeley, who is a huge advocate of R and Shiny. Whilst he himself has run several workshops over the years I, notably, have run zero!

    At the TtT (got to get an acronym in there) I had the opportunity to meet lots of motivated people from around the British Isles who were as keen as I was, not only to learn how to teach R but also to talk about data – that happened quite a bit in the breaks. We had an opportunity to present to each other, and that was useful as I learn especially from watching others. Everyone has their own style and it gets perfected over time but I was hugely impressed by how engaging people were and how quickly they could read about a subject that was new to them (we looked at the RStudio presentation slides https://education.rstudio.com/teach/materials/) and then go on to express clearly what they’d just learned.

    I could go on about what I learned at the course, but the proof of its value is in what I did with it. And so on 17th January, Chris and I held a workshop for 15 people from various organisations; some had travelled as far as London and Portsmouth, such is the commitment to learn R. Chris led the workshop and I did the live coding/support role which took the edge off that ‘first time’ feeling.

    This idea of having at least two people in a workshop is a good one, even when the trainer is very experienced. Chris, for example, is very used to running workshops alone, but inevitably people get stuck or things don’t work as they should and so I did the helping while Chris moved between training and coding. It felt, to me at least, like the burden was shared. It helped to ensure that no-one in the group was allowed to fall behind so far that they just gave up.

    Chris and I had gone through the slides beforehand as he’d not gone on the TtT course, and having not written them himself, I wanted to make sure he’d know what was covered. What reassured me was that, as he presented the course, there wasn’t a part of it that I didn’t understand myself and couldn’t cover if I had to take over at that moment. And so the day wasn’t as nerve-racking as I anticipated, and I had fun – which must have been noticeable to the group, as I had an email commenting on how we are clearly a happy team!

    Whilst I haven’t actually run a workshop, I think the process I’ve gone through to get to this point has certainly built up my confidence to do it. I’ve taken every opportunity NHS-R community has offered, from doing the introduction to presenting at the conference, and so this next thing – to run the training myself – hasn’t been so scary after all. I feel like a happy and well-supported apprentice of NHS-R, and the great thing about NHS-R in this is that everyone can be an apprentice too – you just have to get involved.

    Being open-source, all the slides for the trainer course and the introduction course are available on GitHub:

    Train the Trainer course materials can be found at:


    Course materials for the underlying Intro to R course are found at:


    Zoë Turner, Senior Information Analyst, Nottinghamshire Healthcare NHS Foundation Trust.

    @Letxuga007 and curator of @DataScienceNotts (formerly known as @AppliedInfoNotts)

  • Forecasting R

    By Bahman Rostrami-Tabar

    Democratising forecasting project

    The initiative sponsored by the International Institute of Forecasters provides cutting-edge training in the use of forecasting with R software in developing countries. There is no doubt that many people in developing countries cannot afford high fees to attend forecasting workshops. The project aims to make such a training accessible to those people and provide up-to-date training on the principles of forecasting and create a network of forecasters to conduct research on forecasting with social impact for less developed countries. The training has been delivered in Tunisia, Iraq, Senegal, Uganda, Nigeria, Turkey, Indonesia so far.

    We expand the “Democratising Forecasting” initiative to deliver four forecasting workshops per year for the National Health Service (NHS), UK, in collaboration with NHS-R community. We give our time for free despite the debate that such organisations may afford to pay such trainings.

    Why forecasting for NHS

    Ensuring life-saving services are delivered to those who need them require far more than money, infrastructure and scientific progress. Accurate modelling and forecasting systems can assist critical decisions that drive policy making, funding, research, and development in the NHS. Decision makers are making decisions every day with or without forecasts. However, they are more robust and well-informed in the light of what could happen in the future, and that is where forecasting becomes crucial. However, implementing forecasting principles to support decision-making process requires significant technical expertise. To that end, we aim to organize a two-day workshop on forecasting to teach participants how to apply principles of accurate forecasting using real data in healthcare.

    We are offering four workshops in 2020 as per below:

    • Monday 24th & Tuesday 25th February 2020 – Nottinghamshire – Workshop fully booked
    • Monday 22nd June 2020 – Huddersfield – Online https://www.eventbrite.co.ukworkshop is fully booked
    • Thursday 13th & Friday 14th August 2020 – Gloucestershire
    • Wednesday 13th & Thursday 14th October 2020 – West Midlands
    • Thursday 12th & Friday 13th November 2020 – Cardiff

    Workshop instructor

    Dr. Bahman Rostami-Tabar is the main instructor for the workshops. Colleagues from other Universities in the UK might join him occasionally. Bahman is a Senior Lecturer (Associate Professor) in management science at Cardiff Business School, Cardiff University, UK. He is interested in the use and the Implication of forecasting in social good areas such as health and humanitarian. He has been involved in various forecasting related projects with NHS Wales, Welsh Ambulance Service Trust and the International Committee of the Red Cross. He is currently working on projects that focus on developing innovative forecasting methodologies and their links to decision making in health and humanitarian operations.

    Who should attend

    This workshop is for you if you are:-

    • A decision maker who wants to use forecasting tools and techniques using R to empower decision making;
    • A data analyst who wants to gain in depth understanding of forecasting process.
    • A Forecaster who wants to learn how to use R software for forecasting purpose.

    What participants will learn in the workshop

    Assuming basic knowledge of statistics, participants will be able to do the following tasks when using forecasting to support decision making process in real World:

    • Determine what to forecast according to a forecasting process framework;
    • Prepare and manipulate data using functions in basic R, tidyverse and lubridate packages in R;
    • Identify systematic patterns using time series toolbox in ggplot2 and forecasting related packages in R such as forecast package;
    • Produce point forecasts and prediction intervals using R functions in forecasting related packages such as forecast and fable and user defined functions;
    • Determine the accuracy of forecasting models using statistical accuracy performance measures for point and prediction intervals;
    • Visualize, export and report result for interpretation and insights using RMarkdown.


    • Basic knowledge in statistics is assumed, e.g. “I know what normal distribution is”;
    • Basic knowledge of R is assumed, e.g. “I know what data type and data structure is”, “I know how to use a function”;
    • No knowledge of forecasting is assumed;


    Start: 09:30 a.m.

    End: 04:30 p.m.

    Refreshment breaks:

    • Morning: 11:00 – 11:20
    • Afternoon: 03:00 – 03:20 p.m.

    Lunch: 12:30 p.m. – 01:30 p.m.


    Title: Essentials to do forecasting using R

    Date: Two weeks before the workshop (TBC)

    Day 1

    1.1.Forecasting and decision making: What is forecasting? How forecasting task is different from other modelling tasks? What is the link between forecasting and decision making, how to identify what to forecast?

    1.2. Data preparation, and manipulation: how to prepare data for the forecasting task, how to clean data? How to manipulate data to extract time series?

    1.3. Time series patterns and decomposition: what could be used in data for forecasting task? how to detect systematic pattern in the data? how to separate non-systematic pattern?

    1.4. Forecaster’s toolbox: How to use time series graphics to identify patterns?  what is a forecasting benchmark? What are the simple forecasting methods that could be used as benchmark? How to generate point forecasts and prediction interval using simple forecasting methods?

    1.5. Forecast accuracy evaluation: How do we know if the forecasting method captures systematic patterns available in data? How to judge whether a forecast is accurate or not? How to evaluate the accuracy of point forecasts and prediction interval? Why do we need to distinguish between fitting and forecast?

    1.6. Exponential smoothing models: What is the exponential smoothing family? what are available models in this family? What is captured by this family? how to generate point forecast and prediction intervals using exponential smoothing models?

    Day 2

    2.1. ARIMA models: This is another important family of forecasting models. What is the ARIMA framework? what are available models in this family? What is captured by this family? how to generate point forecast and prediction intervals using ARIMA models? 

    2.2. Regression: We also look at causal techniques that consider external variables. What is the difference between regression and exponential smoothing and ARIMA? How to build a simple regression model?

    2.3. Special events: In addition to using systematic patterns in time series, we discuss how to include deterministic future special events, such as holidays, festive days, etc in models.

    2.4. Forecasting by aggregation:  How to forecast in a situation where we have a high frequency time series, e.g. daily but we need a low frequency forecast, e.g. monthly?  How do we generate forecast for items with a hierarchical or grouped time series structure?

    2.5. Forecasting by combination:  how to use an ensemble of forecasting approaches? In which conditions ensemble forecasts are better than individual methods?

    2.6. Forecasting for many time series: what is the best approach to forecast many time series? How to classify items for forecasting? How to forecast them? How to report the accuracy?  


    Workshop Booklet:

    • Materials will be provided for the workshop in RMarkdown.


    Bahman Rostami-Tabar

    Associate Professor in Management Science, Cardiff University, UK


  • NHS-R Community Conference II

    My journey to work takes me about an hour and a half, and I catch a couple of buses with Wi-Fi which means I can browse Twitter and invariably end up with hundreds of tabs open as I flit between articles and blogs. Most mornings I find it hard to concentrate on reading through entire articles, especially the really long ones, so I leave the tab open on my computer, often for days, before reading them. Given my experience of reading blogs, why would anyone want to read through mine about the NHS-R Community conference?

    If I’d gone to the conference I’d probably skim that paragraph thinking ‘yes, I went, I know how good it was’.

    If I’d not gone to the conference I’d probably skim that paragraph because I might prefer not to know just how great a conference was when I’d missed it!

    Even though the conference was moved to a bigger location to accommodate more people and around 250 people attended, I have still spoken to people who didn’t get a ticket or missed submitting an abstract to speak. People who never made the conference are talking about an event that is only in its 2nd year. What is going on? What is it that has made the event so successful?

    Organising an event of any size takes a lot of work and that is often overlooked. There were the core people who did the real work – the arrangements – and quite frankly, they made it look easy, which itself is an indication of how hard they worked. But there were others who were part of a committee that chipped in with bits they could help with: setting up a specific email; reading through abstracts; suggesting things the organisers might consider, like how to ensure diversity of questioners (https://oxfamblogs.org/fp2p/how-to-stop-men-asking-all-the-questions-in-seminars-its-really-easy/).

    That organising committee was made up from a group who have shown a particular interest in R, and as such I found myself part of that group. Now although I have submitted a few blogs to NHS-R, I only really started using R a couple of years ago. Deep down I’m still a SQL analyst and my contributions to the conference were pretty minimal, but I feel encouraged to make those small contributions (even that last one about who gets to ask the first question in seminars) and each small involvement builds up to a bigger thing. This really is feeling like an equal and inclusive group and that’s where I think this success is coming from.

    It may have been by design or it may be a happy accident but there is a crucial clue in the name of this group that gives away its success – Community. This conference wasn’t managed top-down. There are some key people, of course, but they are as much of this Community as the people who contribute to the blogs, those that stood up on stage and showed their work, have those that will be learning to run the R Introduction training. This is our NHS-R Community.

    If you missed this year’s conference and want to go to the next one, get involved. The more people involved, the less work there is for everyone individually. Plus, given that tickets this year ran out in just 2 hours, you’ll be more likely to secure yourself a ticket.

    Speaking of which, provisional dates for the next conference are the 2nd and 3rd November 2020 (Birmingham). Now aren’t you glad you read this blog!

    Zoë Turner, Senior Information Analyst @AppliedInfoNott @Letxuga007

  • Welcome to Leeds

    R Users Leeds

    Why create a group ?

    Trying to learn R is tricky, there are so many ways to do things that I often ended up Googling everything and still not feeling like I was doing anything the right way.

    Then someone sent me a link to the NHS-R Community website where I found lots of interesting and helpful information. Plus there were R groups, but not one in Leeds. I was surprised by this given the density of health analysts and other R users in and around Leeds. There should definitely be a group in Leeds.

    How to create a group ?

    So I ran my idea past a colleague to check if I was crazy. They agreed to help anyway and offered to email some contacts at the University of Leeds. I approached the Sheffield-R group and the NHS-R community (who also thought it was a good idea). We also connected with NHS Digital and NHS England and Improvement.

    Through this process we were very lucky to find some clever individuals who had already thought about starting up a group in Leeds. This meant the group already had a website (R Users Leeds), Twitter, email and logo. This gave us a great starting point with a planning team covering Leeds Teaching hospitals NHS Trust, NHS Digital, University of Leeds, Leeds Beckett University and Sky Betting and Gaming. We are very lucky in Leeds to have access to both local and national NHS organisations, academic institutions and digital and technology companies.

    What do we want from a group ?

    From the beginning it was important to think about the aims and ethos of the group which for Leeds are:

    • Create an open and inclusive community
    • Meetings should be free
    • Organise meetings within and outside office hours to ensure everyone has the opportunity to participate.
    • Include the NHS, academia and other industries using R

    Where will the group meet ?

    Another consideration was how to get venues especially at low or no cost. After making contacts I was assured that I could get things for free, I wasn’t too sure but thought that I should give it a go. This is going to be an ongoing task but I have been surprised by how many organisations want to help. Following a first planning meeting we managed to organise speakers and get a date and venue for the first meeting:

    5th December 2019, NHS Digital, Leeds

    • Reproducible wRiting with RMarkdown. Ed Berry, Senior Data Scientist, Sky Betting & Gaming
    • Efficient WoRkflows getting more done with R. Robin Lovelace, University Academic Fellow, University of Leeds
    • Putting the R into Reproducible Research. Anna Krystalli, Research Software Engineer, University of Sheffield

    R we nearly there yet ?

    Our first event is almost upon us so I suppose we have nearly reached the first destination. With tickets booking up quickly it is clear to see that there is a need for a group in Leeds to bring the analytical community together.Thank you to everyone who has supported us so far, hopefully this is just the beginning of the journey.

    If you would like to speak at R Users Leeds please get in touch.

    You can keep in touch with us on Twitter, GitHub, Gitter and email.

    R Users Leeds

    This blog was written by Louise Hick, Real World Data Analyst, the Leeds Teaching Hospitals NHS Trust
  • NHS Number Validation

    NHS Numbers – the backbone of the NHS

    NHS Numbers are the best. They are numbers that follow you through life, from birth to death, changing only with certain situations and are easy to check for validity from just the numbers themselves.

    What? You can check to see if a number is a genuine NHS Number without any other information?!

    Now, I’ve worked in the NHS for more years than I care to count and I never realised there was an algorithm to these numbers. I stumbled across this fact through Twitter of all places. Yes, Twitter, the place you’ll find NHS-R enthusiasts tweeting things like this:

    https://twitter.com/sellorm/status/1171858506057682944 which I saw retweeted by @HighlandR (John MacKintosh).

    I’ve only just started doing things on GitHub but @Sellorm wasn’t very close in saying there may be millions of packages as there are, in fact there are only 2 for R, but surprisingly, not a single one in SQL.

    I installed the package, had a quick play around and looked at the code on GitHub. The downside for this package, for me, was that you need to feed in your data using vectors and I like using data tables. Plus, I didn’t necessarily want a list of outputs like TRUE, TRUE, FALSE, TRUE but I wanted to see the NHS numbers that aren’t valid. Still, I wouldn’t have got so far, so quickly, without @Sellorm’s code, excellent notes and thoroughly written Readme file and so the moral of the story is, even if the package doesn’t work for a task or problem, you may still be able to use parts of the code.

    The Readme on https://github.com/sellorm/nhsnumber, which you can see when you scroll down past all of the file directory looking things, included a link to the wiki page on NHS Numbers and its algorithm check https://en.Wikipedia.org/wiki/NHS_number. Again, I had no idea this existed!

    Wiki, like R, is open sourced and one of the links was out of date. I’d found a really useful document on NHS Numbers a while back https://www.closer.ac.uk/wp-content/uploads/CLOSER-NHS-ID-Resource-Report-Apr2018.pdf, so, in the spirit of open source, I updated the Wiki page. Proud moment!

    The algorithm is quite simple and another package on GitHub https://github.com/samellisq/nhsnumbergenerator generates numbers using it but I didn’t spend too long on these packages as I decided to do my own script, nothing fancy, no loops, functions or packages…

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber),
             A = as.numeric(substring(NHSNumber,1,1)) *10,
             B = as.numeric(substring(NHSNumber,2,2)) *9,
             C = as.numeric(substring(NHSNumber,3,3)) *8,
             D = as.numeric(substring(NHSNumber,4,4)) *7,
             E = as.numeric(substring(NHSNumber,5,5)) *6,
             G = as.numeric(substring(NHSNumber,6,6)) *5,
             H = as.numeric(substring(NHSNumber,7,7)) *4,
             I = as.numeric(substring(NHSNumber,8,8)) *3,
             J = as.numeric(substring(NHSNumber,9,9)) *2,
             End = as.numeric(substring(NHSNumber,10,10)),
             Total = A+B+C+D+E+G+H+I+J,
             Remainder = Total %% 11,
             Digit = 11- Remainder,
             Summary = case_when(Digit == 10 ~ 999,
                                 Digit == 11 ~ 0,
                                 TRUE ~ Digit),
             Valid = case_when(Summary == End & length == 10 ~ TRUE,
                               TRUE ~ FALSE
                               )) %>% 
      filter(Valid == FALSE)

    Importing data

    Data is imported into R via Excel or, in my case, I used a SQL connection which worked better as a couple of our systems hold many hundreds of thousands of NHS Numbers and I wanted to check them all. Excel might have got a bit of indigestion from that. Also, it meant I wasn’t storing NHS Numbers anywhere to then run through R. My systems are secure but it’s always worrying having such large amounts of sensitive data in one place and outside of an even securer system like a clinical database.

    Data doesn’t always import into R the same way and for mine I needed to remove NULLs and make the NHSNumber column, which was a factor, into a character and then numeric format:

    df <- data %>% 
      filter(!is.na(NHSNumber)) %>% 
      mutate(NHSNumber = as.numeric(as.character(NHSNumber)))


    Factors are a new thing for me as but, as I understand it, they put data into groups but in the background. For example, if you had male, female and other these would be 3 factors and if you join that to another data set which doesn’t happen to have “other” as a category the factor would still linger around in the data, appearing in a count as 0 as the table still has the factor information but no data in that group.

    To manipulate factor data I’ve seen others change the format to character and so that’s what I did. I’ve done similar things in SQL with integers and dates; sometimes you have to change the data to an ‘intermediate’ format.

    Validity code – explanation

    This code uses dplyr from tidyverse to add columns:

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber), 

    An NHS number must be 10 characters and nchar() reminds me of LEN() in SQL which is what I would use to check the length of data.

    One thing I didn’t code, but I guess people may want to check, is that all characters are numeric and no letters have been introduced erroneously. That’s something to consider.

         A = as.numeric(substring(NHSNumber,1,1)) *10,
         B = as.numeric(substring(NHSNumber,2,2)) *9,
         C = as.numeric(substring(NHSNumber,3,3)) *8,
         D = as.numeric(substring(NHSNumber,4,4)) *7,
         E = as.numeric(substring(NHSNumber,5,5)) *6,

    I didn’t use F as it’s short for FALSE which changed the meaning and colour of the letter. It can be used as a column name but I missed it out for aesthetics of the script!

         G = as.numeric(substring(NHSNumber,6,6)) *5,
         H = as.numeric(substring(NHSNumber,7,7)) *4,
         I = as.numeric(substring(NHSNumber,8,8)) *3,
         J = as.numeric(substring(NHSNumber,9,9)) *2,

    There is possibly a much smarter way of writing this, perhaps a loop. Samuel Ellis’ package nhsnumbergenerator creates a table using sequences:

    checkdigit_weights = data.frame(digit.position=seq(from=1, to = 9, by =1),
                                    weight=seq(from=10, to = 2, by =-1)

    note he uses = rather than <-

    I like this code. It’s very concise, but it just was easier writing out the multiplications for each part of the NHS Number; the first number is multiplied by 10, the second by 9, the third by 8 and so on.

         End = as.numeric(substring(NHSNumber,10,10)),

    I put this in as the final number in the sequence is the check for the later Digit (also called checksum).

         Total = A+B+C+D+E+G+H+I+J,

    Just adding all the multiplied columns together.

         Remainder = Total %% 11,

    This gives the remainder from a division by 11.

         Digit = 11- Remainder,

    11 take away the remainder/checksum.

         Summary = case_when(Digit == 10 ~ 999,
                             Digit == 11 ~ 0,
                             TRUE ~ Digit),

    Lots of people use if(else) in R but I like case_when because it’s like SQL. The lines run in logical order:

    • when the digit = 10 then it’s invalid so I put in 999 as that’s so high (in the hundreds) it shouldn’t match the Digit/Checksum (which is a unit),
    • if it’s 11 then change that to 0 following the methodology,
    • else just use the digit number.

    Actually, thinking about it I probably don’t need the 10 becomes 999 as 10 could never match a single unit number. Perhaps that’s redundant code.

         Valid = case_when(Summary == End & length == 10 ~ TRUE,
                           TRUE ~ FALSE
                           )) %>% 

    Case_when again but this time to get the TRUE/FALSE validity. If the number generated is the same as the last digit of the NHS Number AND the length of the NHS Number is 10 then TRUE, else FALSE.

    I liked this like of code as it was a bit strange saying TRUE then FALSE but it’s logical!

      filter(Valid == FALSE)

    Just bring back what isn’t valid.

    Did it work?

    I think it did. If I got anything wrong I’d love to get feedback but I ran through many hundreds of thousands of NHS Numbers through it and found…..

    No invalid numbers

    I possibly should have checked beforehand but I suspect our clinical systems don’t allow any incorrect NHS Numbers to be entered in at source. Still, it was fun and could be applied to manually entered data from data kept in spreadsheets for example.

    An interesting blog

    As here were no SQL code scripts on GitHub for NHS Number validations I did a quick search on the internet and found this: https://healthanalyst.wordpress.com/2011/08/21/nhs-number-validation/ which is a blog by @HealthAnalystUK from 2011. The reason I’m referring to it in this blog is because HealthAnalystUK not only shared SQL code that looked very similar to the code here but also R and uses it as a function:

    NHSvalidation <- function(NHSnumber){
    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))
    if (
    ((Modulus==J) & (UniformNumberCheck!=1) & (NHSlength==10))|((Modulus==11) & (J==0) & (UniformNumberCheck!=1) & (NHSlength==10)))

    I hadn’t coded the check for repeating numbers:

    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))

    and I couldn’t find any reference to this in the Wiki page or the document from the University of Bristol so I’m unsure if this is a part of the methodology. If it is, then I’ve seen at least 1 NHS Number that would fail this test.

    A conclusion

    If anyone has created a package or script for NHS number checks and wants to share please feel free to write a blog. NHS-R Community also has a GitHub repository at https://github.com/nhs-r-community where code like this blog can go (I wrote this in Rmarkdown).

    Blogs can be emailed to nhs.rcommunity@nhs.net and get checked by a group of enthusiast volunteers for publishing.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • NHS-R Community datasets package released

    This post briefly introduces an R package created for the NHS-R Community to help us learn and teach R.

    Firstly, it is now available on CRAN, the major package repository for R, and can be installed like any other package, or directly from GitHub as follows:



    Several community members have mentioned the difficulties learning and teaching R using standard teaching datasets. Stock datasets like iris, mtcars, nycflights13 etc. are all useful, but they are out-of-context for most NHS, Public Health and related staff. The purpose of this package to provide examples datasets related to health, or reusing real-world health data. They can be accessed easily and used to communicate examples, learn R in context, or to teach R skills in a familiar setting.

    For those of us wanting to contribute to Open Source software, or practise using Git and GitHub, it also provides an opportunity to learn/practise these skills by contributing data.

    What’s in it?

    At present we have two data sets:

    Name Contributor Summary
    LOS_model Chris Mainey Simulated hospital data, originally for learning regression modelling
    ae_attendances Tom Jemmett NHS England’s published A&E attendances, breaches and admission data

    Both datasets have help files that explain the source, data columns, and give examples of use. The package also contains vignettes that show the datasets being put to use. You can find them in the help files, or using:

    # list available vignettes in the package:
    # Load the `ae_attendances` vignette:

    How can you contribute?

    We are aiming to build a repository of different health-related datasets. We are particularly keen to include primary care, community, mental health, learning disabilities, or public health data.

    Please head to the GitHub page, and follow the submission guidelines on the README page. If you would like to add a vignette or alter an existing one, follow the same instructions: fork the repository, commit and push your changes, then make a ‘pull request.’ We will then review your submission before merging it into the package.

    Please ensure that any data you submit is in the public domain, and that it meets all obligations under GDPR and NHS/Public Health information governance rules.

    To see the package code, or contribute, visit https://github.com/nhs-r-community/NHSRdatasets .


    The NHSRdatasets package is a free, collaborative datasets package to give context for NHS-R community when learning or teaching R. The package is available on CRAN. We actively invite you to submit more datasets and help us build this package.

    To read the reference material, and vignettes, you can find them in the package help files or go to the pkgdown website: https://nhs-r-community.github.io/NHSRdatasets/

    Chris Mainey
    Intelligence Analyst

    Health Informatics – University Hospitals Birmingham NHS Foundation Trust

  • How R changed me as an analyst

    I suspect there are many blogs about R and why it’s so great:

    • It’s free!
    • It’s open source!
    • There’s a great community!
    • It’s reproducible!

    You can certainly read on Twitter (#rstats) about what R cando for you but what about what R does to you, particularly as an analyst in the NHS or social care?

    The learning bit…

    Back in April 2018 when NHS-R Community was running its first introduction to R course in Leeds, my only knowledge of it had come from a free online base R course with edX that I hadn’t really understood and never finished. Online learning for me is like being back at school. I get the right answer, feel good, move on and promptly forget it all. After the NHS-R course I dabbled with the dplyr package, tried to run lots of other people’s scripts and generally failed a lot. It was a frustrating time of slow delivery of tasks and bafflement. When things did work, I had no idea why and I often wondered what all the fuss was about particularly as I could do the same things in familiar programmes.

    Hindsight is a wonderful thing and I can now see my frustrations weren’t just one lump of confusion, but could be split down into how I used these ‘familiar programmes’, namely:

    • SQL for data engineering and
    • Excel for visualisations.

    Although I still used (and use) SQL to get my data, I was copying it to Excel and then loading it into R; once loaded I’d then realise I needed to group by and count or remove something I didn’t need and it seemed too long-winded going back to SQL, copying to Excel and then loading it.

    The second frustration of visualising in R came with the inability to replicate the look of the Excel charts in R: getting the same colours, the same font size headers and so on. I’ve yet to resolve that completely but it was here that I realised the worth of R wasn’t in making it look like Excel, but rather that it could do so much more than Excel. I needed to start thinking about what I should be visualising and how to do it.  

    Sharing methodology

    Over my time in the NHS I have learned to be cautious – protective even – of data. But that has led to the misguided fear of sharing technical knowledge which was never a conscious thing, just that’s how it is. However, R has a reputation of sharing which has resulted in an expectation of sharing. And that isn’t just within your own team or organisation, it’s so much wider – it can even be the world.

    As an example of why it’s harder to share Excel methodology, I’ve previously built a benchmarking Excel spreadsheet using MATCH and INDEX so that the bar charts automatically coloured the organisation I worked for and ordered the bars in graphs from the greatest to the smallest. It was one of those tasks that took a lot of effort to automate, looked simple when done but was heavily dependent on the data being in just the right cell or it would break.

    Just updating it with a new year’s data would take great care so the thought of writing out the methodology to share never occurred to me. Writing it out would involve describing the positions of data, what the formulae did and how bits all linked. That’s a laborious task which is not necessary if you don’t plan to share – and as there was no need to share, I didn’t.

    It’s all about the data

    The entire process, from SQL to Excel, is about the data, e.g. how it joins and what it’s counting. To get the data ‘just so’, it often requires so many clever solutions to so many problems that, as I now realise, it consumes so much thinking time that there’s often little energy left for considering why I am doing this – is it the best thing to do and how can I get more meaning from the data?

    If I’d picked up someone else’s script or Excel document on ordering benchmarking data, perhaps the time I spend would be on improving it instead of building it. In a perfect world, I would feed back on what I’d done and share any improvements or corrections.

    But what can R do that SQL and Excel can’t?

    As a very simple example, consider creating a median average.

    In Excel it’s reasonably easy to use the formula MEDIAN() but to make it dynamic (such that it automatically updates if you add more data), the formula becomes much more complicated. Here is a page explaining how it’s done:


    There are lots of graphics are used to describe how to do it and you’ll note this is for AVERAGE which is mean rather than median.

    In SQL, creating the MEDIAN can be solved various ways:


    There are 204 examples to go through to solve this! I didn’t go through it as that’s too much needless thinking required when R can do this…


    Something this simple removes all that otherwise inevitable necessary thinking to figure out the best way to get the median… and then having to check that it is right. Although that may be easy enough in Excel, I know I make mistakes and will have to repeat the exercise more times than I care to admit, and doing so in Excel will involve so many steps that each require checking. All of this uses up that precious resource of focused thinking. With R doing the job so quickly and reliably I now have time to consider if median is actually the right thing to use, particularly if it’s a case of “we’ve always done that”. Then I can ponder on what is it telling me; is it increasing/decreasing and why is that? Is there any other data that could help explain this change?  

    Like any great piece of analysis or coding, time invested at the beginning pays off. With the benchmarking example of reordered bar charts, spending a couple of days getting it right made something interactive that was engaging for the users. But Excel continues to require thinking and time checking whereas R doesn’t; once it’s coded, that’s it. And that’s where the analyst should be, that space after the work is done. That’s where you can think about the very analysis itself; was it enough? What does it tell you? What else can you look at? Although I’ve had analyst in my job title for a long time, this is what analysis is all about and it’s something, I now realise, that I’ve not been doing because my “thinking time” has been used up elsewhere.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • Dygraphs

    Dygraphs for mortality surveillance

    I recently presented some of the mortality surveillance charts we use to @RLadiesLondon (a very welcoming group!) and one that got some interest was a chart of Nottinghamshire Healthcare NHS Foundation Trust deaths compared to ONS Provisionally Registered deaths. The chart looks good because it’s interactive but this type of chart can be confusing because of the 2 y axes.

    When I show this report I make it clear that the two axes units are very different and that its purpose is to show that the pattern of the deaths in the wider population matches that of the deaths recorded by the Trust. It’s well known within Public Health that the pattern of deaths is seasonal, with a peak around January in the UK. However, this Public Health knowledge is not necessarily common knowledge in Secondary Care Trusts and it was one of the great benefits of having @IantheBee both create and present this report.

    Getting ONS Provisional Data

    I wrote about getting and formatting the spreadsheets from ONS for the East Midlands Provisionally Registered deaths:

    but for the purposes of the mortality surveillance report I’ve used several years data and I normally keep the spreadsheets, save the merged data and then load that each time I need to run the R markdown report.

    For the purposes of this blog I’ve amended the formatting code by adding functions so each year can be transformed and is available to plot:


    # Download ONS spreadsheets Function-----------------------------------------------

    GetData_function <- function(www,file){
    destfile = file,
    method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
    mode = "wb") #wb means "write binary" }




    #2016 GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2016/publishedweek522016.xls",

    # Import correct sheets --------------------------------------------------- Deaths_Now <- read_excel("DeathsDownload2019.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2018 <- read_excel("DeathsDownload2018.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2017 <- read_excel("DeathsDownload2017.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2016 <- read_excel("DeathsDownload2016.xls ",
    sheet = 4,
    skip = 2)

    # Look up code to remove excess rows -------------------------------------- LookupList <- c("Week ended",
    "Total deaths, all ages",
    "Total deaths: average of corresponding",
    "E12000004" )

    # Function for formatting data -------------------------------------------- Transform_function <- function(dataframe){

    #Format data frames
    df <- dataframe %>%
    clean_names %>%
    remove_empty(c("rows","cols")) %>%
    filter(week_number %in% LookupList)

    #Transpose table
    df <- t(df)

    #Whilst this is a matrix make the top row the header
    colnames(df) <- df[1, ]

    #Make this object a data.frame
    df <- as.data.frame(df)

    #Function to find 'not in'
    '%!ni%' <- function(x,y)!('%in%'(x,y))

    #List of things to remove to tidy the data.frame
    remove <- c("E12000004", "East Midlands")

    #remove the rows and ensure dates are in the correct format
    df <- df %>%
    filter(E12000004 %!ni% remove) %>%
    mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern")) df$`Week ended` <- as.Date(df$`Week ended`, format = '%Y-%m-%d') df <- df %>% mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    #Final transformation of data
    df %>% select(`Total deaths, all ages`,date) %>%
    filter(!is.na(`Total deaths, all ages`)) %>%
    mutate(`Total deaths, all ages` = as.numeric(as.character(`Total deaths, all ages`))) #To match other data.frames }

    # Run functions -----------------------------------------------------------
    DeathsNow <- Transform_function(Deaths_Now)
    Deaths2018 <- Transform_function(Deaths_2018)
    Deaths2017 <- Transform_function(Deaths_2017)
    Deaths2016 <- Transform_function(Deaths_2016)

    # Merge data -----------------------------------------------------
    Deaths <- bind_rows(DeathsNow,
    Deaths2016) %>%
    mutate(date = as.Date(date),
    `Total deaths, all ages` = as.numeric(`Total deaths, all ages`))

    This code may give a few warnings saying that NAs have been introduced by coercion which is because there are many cells in the spreadsheets that have no data in them at all. It’s a good thing they have nothing (and effectively NAs) as having 0s could confuse analysis as it isn’t clear if the 0 is a real 0 or missing data.

    To suppress warnings in R Markdown add warning=FALSE to the header, however, I like to keep the warnings just in case.

    If you want to keep all the data after merging it together use:

    library(openxlsx) # To write to xls if required. 

    #Export complete list to excel for future
    write.xlsx(Deaths, 'ImportProvisionalDeaths.xlsx')

    If you’ve saved the combined file, to call it again in a script use the following code:

    library(readxl) Deaths <- read_excel("ImportProvisionalDeaths.xlsx") 

    Dygraph chart

    The following data is randomly generated as an example:


    #Fix the randomly generated numbers set.seed(178)

    Alldeaths <- Deaths %>%
    select(date) %>%
    mutate(n = rnorm(n(), mean=150))

    Merge the two data frames together:

    ONSAlldeaths <- Deaths %>%    
    left_join(Alldeaths, by = "date") %>%
    mutate(ds = as.POSIXct(date)) %>%
    select(ds, y2 = n, y = `Total deaths, all ages`) %>%

    Dygraphs require dates to be in a time series format and the package xts can convert it:

    ONSAlldeaths_series <- xts(ONSAlldeaths, order.by = ONSAlldeaths$ds) 

    The date column is no longer needed so can be removed but this needs to be done using base R and not dplyr:

    #Remove duplicate date column ONSAlldeaths_series <- 
    ONSAlldeaths_series[, -1]

    Finally, the xts can be plotted:

    dygraph(ONSAlldeaths_series, main = "East Midlands Weekly Deaths/Randomly generated numbers") %>%   
    dySeries("y", axis = "y", label = "East Midlands") %>%
    dySeries("y2", axis = "y2", label = "Random numbers") %>%
    dyAxis("x", drawGrid = FALSE) %>%
    dyAxis("y", drawGrid = FALSE, label = "East Midlands") %>%
    dyAxis("y2", independentTicks = TRUE, drawGrid = FALSE, label = "Random numbers") %>%
    dyOptions(stepPlot = TRUE) %>%

    Note this chart is not interactive in this blog.

    When you’ve plotted the chart if you wave the cursor over the points you will see information about those points, you are also able to zoom in by using the scrolling bar at the bottom of the chart (this was from the dyRangeSelector() code.

    Other options are detailed here: https://rstudio.github.io/dygraphs/index.html

    ONS Provisional data

    One of the things that may stand out in the chart for the are the big dips around 29-31 December time each year and we presume that these relate to the week where Registrations may be delayed from GP practices to ONS because of the public holidays around Christmas.

    Unfortunately, only provisionally recorded deaths are available by week as confirmed are monthly: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence

    Note this chart is not interactive in this blog.

    The case of the mysteriously disappearing interactive graph in R Markdown html output

    I’d rendered (knit or run) the html reports with the interactive graphs and it had all worked so I emailed the report to people as promised and then the emails came back: “Some of the graphs are missing, can you resend?”. Perplexed, I opened the saved file from the server and, yes, indeed some of the charts had disappeared! Where there should be lovely interactive charts were vast swathes of blank screen. What had happened? The code ran fine, looked fine and how do you even google this mystery?

    Turns out my default browser, and I suspect it is throughout most of the NHS because lots of NHS systems depend on it, is Microsoft Explorer. Whilst I have the latest version these reports have never opened properly in Explorer.

    The solution: Chrome (or some other browser). I ask people to copy the link from the Explorer web address bar after opening it from the email and simply paste it to Chrome.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust. 

    @AppliedInfoNott and @Letxuga007

  • Count of working days function

    It’s at this time of year I need to renew my season ticket and I usually get one for the year. Out of interest, I wanted to find out how much the ticket cost per day, taking into account I don’t use it on weekends or my paid holidays. I started my workings out initially in Excel but got as far as typing the formula =WORKDAYS() before I realised it was going to take some working out and perhaps I should give it a go in R as a function…

    @ChrisBeeley had recently shown me functions in R and I was surprised how familiar they were as I’ve seen them on Stack Overflow (usually skimmed over those) and they are similar to functions in SQL which I’ve used (not written) where you feed in parameters. When I write code I try to work out how each part works and build it up but writing a function requires running the whole thing and then checking the result, the objects that are created in the function do not materialise so are never available to check. Not having objects building up in the environment console is one of the benefits of using a function, that and not repeating scripts which then ALL need updating if something changes.

    Bus ticket function

    This is the final function which if you run you’ll see just creates a function.

    #Week starts on Sunday (1)
    DailyBusFare_function <- function(StartDate,EmployHoliday,Cost,wfh){
      startDate <- dmy(StartDate)
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <-seq(from = startDate, to = endDate, by = "days")
      working_days <- sum(wday(myDates)>1&wday(myDates)<7)-length(holidayLONDON(year = lubridate::year(startDate))) - EmployHoliday - wfh
    per_day <- Cost/working_days

    Running the function you feed in parameters which don’t create their own objects:


    [1] 2.707965

    Going through each line:

    To make sure each part within the function works it’s best to write it in another script or move the bit betweeen the curly brackets {}.

    Firstly, the startDate is self explanatory but within the function I’ve set the endDate to be dependent upon the startDate and be automatically 1 year later.

    Originally when I was trying to find the “year after” a date I found some documentation about lubridate’s function dyear:

    #Next couple of lines needed to run the endDate line!
    startDate <- dmy("11/07/2019")
    endDate <- startDate + dyears(1)

    but this gives an exact year after a given date and doesn’t take into account leap years. I only remember this because 2020 will be a leap year so the end date I got was a day out!

    Instead, @ChrisBeeley suggested the following:

    endDate <- as.Date(startDate) %m+% months(12)

    Next, the code builds a sequence of days. I got this idea of building up the days from the blogs on getting days between two dates but it has also come in use when plotting over time in things like SPCs when some of the time periods are not in the dataset but would make sense appearing as 0 count.


    #To run so that the sequencing works
    #using as.Date() returns incorrect date formats 0011-07-20 so use dmy from
    #lubridate to transform the date
      startDate <- dmy("11/07/2019")
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <- seq(from = startDate, to = endDate, by = "days")

    All of these return values so trying to open them from the Global Environment won’t do anything. It is possible view the first parts of the values but also typing:

    #compactly displays the structure of object, including the format (date in this case)

    Date[1:367], format: “2019-07-11” “2019-07-12” “2019-07-13” “2019-07-14” “2019-07-15” …

    #gives a summary of the structure
    Min.      1st Qu.       Median         Mean      3rd Qu.

    “2019-07-11” “2019-10-10” “2020-01-10” “2020-01-10” “2020-04-10” Max. “2020-07-11”

    To find out what a function does type ?str or ?summary in the console. The help file will then appear in the bottom right Help screen.

    Next I worked out working_days. I got the idea from a blog which said to use length and which:

    working_days <- length(which((wday(myDates)>1&wday(myDates)<7)))

    Note that the value appears as 262L which is a count of a logical vector. Typing ?logical into the Console gives this in the Help:

    Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer._

    I was familiar with “length”, it does a count essentially of factors or vectors (type ?length into the Console for information) but “which” wasn’t something I knew about. @ChrisBeeley explained what which does with the following example:

    #Generate a list of random logical values
    a <- sample(c(TRUE, FALSE), 10, replace = TRUE)
    #Look at list


    #using which against the list
    gives the number in the list where the logic = TRUE

    [1] 4 5 8

    #counts how many logical
    arguments in the list (should be 10)

    [1] 10

    #counts the number of TRUE
    logical arguments

    [1] 3

    Then @ChrisBeeley suggested just using “sum” instead of length(which()) which counts a logical vector:


    [1] 3

    It seems this has been discussed on Stack Overflow before: https://stack overflow.com/questions/2190756/how-to-count-true-values-in-a-logical-vector

    It’s worthy of note that using sum will also count NAs so the example on Stack Overflow suggest adding:

    sum(a, na.rm = TRUE)

    [1] 3

    This won’t affect the objects created in this blog as there were no NAs in them but it’s just something that could cause a problem if used in a different context.

    working_days <- sum(wday(myDates)>1&wday(myDates)<7)
    #Just to check adding na.rm = TRUE gives the same result
      working_days <- sum(wday(myDates)>1&wday(myDates)<7, na.rm = TRUE)

    I then wanted to take into account bank/public holidays as I wouldn’t use the ticket on those days so I used the function holidayLONDON from the package timeDate:

    length(holidayLONDON(year = lubridate::year(startDate)))

    [1] 8

    I used lubridate::year because the package timeDate uses a parameter called year so the code would read year = year(startDate) which is confusing to me let alone the function!

    Again, I counted the vectors using length. This is a crude way of getting bank/public holidays as it is looking at a calendar year and not a period (July to July in this case).

    I did look at a package called bizdays but whilst that seemed to be good for building a calendar I couldn’t work out how to make it work so I just stuck with the timeDate package. I think as I get more confident in R it might be something I could look into the actual code for because all packages are open source and available to view through CRAN or GitHub.

    Back to the function…

    I then added “- EmployHoliday” so I could reduce the days by my paid holidays and “- wfh” to take into account days I’ve worked from home and therefore not travelled into work.

    The next part of the code takes the entered Cost and divides by the Working_days, printing the output to the screen:

    per_day <- Cost/working_days


    And so the answer to the cost per day is printed in the Console:


    [1] 2.707965

    A conclusion… of sorts

    Whilst this isn’t really related to the NHS it’s been useful to go through the process of producing a function to solve a problem and then to explain it, line by line, for the benefit of others.

    I’d recommend doing this to further your knowledge of R at whatever level you are and particularly if you are just learning or consider yourself a novice as sometimes blogs don’t always detail the reasons why things were done (or why they were not done because it all went wrong!).

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @AppliedInfoNott and @Letxuga007

    The R Markdown file is available here:

  • rforhealthcare.org – A free online resource for all your R related healthcare needs

    Down in the South West of England we at the PenCHORD team of the NIHR CLAHRC South West Peninsula have been busy developing some online treats for all of you interested in using the statistical programming language R. The www.rforhealthcare.org website is a free resource full of R guides and tutorials designed to get you started with R and RStudio. But we do not stop at simply introducing you to the R language, oh no, we have begun to produce tutorials on more complex topics such as linear regression, performing t-tests, statistical process control and principle component analyses.

    The rationale behind developing this site has been to create an easily accessible resource for those working in and around the healthcare sector to support you in using R on a day-to-day basis as an alternative to Excel and SQL. The training topics have been selected so they are relevant to the types of tasks carried out by healthcare analysts while explaining things in such a way that you do not need a PhD in computer science to understand what is going on!

    The website has been designed to be simple and user friendly. If you are completely new to R there is an introductory presentation that you can download on the homepage. This describes what R and RStudio are, how to install them and some basic information about how to go about using R.

    Once you have R and RStudio installed then it is time to start learning about R functionality. There are currently three main sections to the tutorials: Basic functionality, Plotting and Statistics. The ‘Basic functionality’ section covers everything from introducing data types and data structures to transforming and manipulating data through to writing your own functions. The plotting section introduces you to creating all of the basic types of graphs that you need from histograms to box plots. The statistics section then introduces some useful statistical techniques such as t-tests and linear regression.

    A good way to get started using R in the healthcare environment is to take a simple but repetitive task that you currently do in Excel or SQL and translate it into a script that can be used time and time again. This might be something such as data cleaning and transformation or producing a series of graphs and tables. The list of tutorials will keep growing as we find out which topics you would like to learn to implement in R. So please do get in touch through the website or twitter (@PenCLAHRC) and let us know if there is tutorial you would like us to write for you.

    Over the last 18 months we have also created a large amount of training materials on a variety of topics for use on our innovative Health Service Modelling Associates programme (www.health-modelling.org/). One example is the www.pythonhealthcare.org website which is a free resource for using the programming language python for healthcare related applications. There are topics to help you go from a complete beginner to advanced level programmer in no time! Over the coming months we will be making all of our training materials available for free online in the areas of: introducing simulation modelling in healthcare, problem structuring for simulation modelling, system dynamics modelling, discrete event simulation, network analysis and machine learning, so watch this space!

    This blog was written by Dr Sean Manzi, Research Fellow at NIHR CLAHRC South West Peninsula.

  • Format ONS spreadsheet


    A Public Health consultant colleague Ian Bowns (@IantheBee) created a report to monitor mortality within the Trust and he used the ONS weekly provisional data for the East Midlands to compare the pattern and trends of deaths over time. This involves downloading a file from:


    which is updated weekly. Once a month I, manually, add numbers from this to another spreadsheet to be imported to R for the overall analysis.

    Downloaded file formats

    You may be familiar with ONS and other NHS data spreadsheets format and if you are not, here are some of the issues:

    • data is presented in wide form, not long form (so columns are rows and vice versa)
    • the sheets are formatted for look rather than function with numerous blank rows and blank columns
    • there are multiple sheets with information about methodology usually coming first. This means a default upload to programmes like R are not possible as they pick up the first sheet/tab
    • the file name changes each week and includes the date which means any code to pick up a file needs to be changed accordingly for each load
    • being Excel, when this is imported into R, there can be problems with the date formats. These can get lost to the Excel Serial number and
    • they include a lot of information and often only need a fraction of it

    Given these difficulties there is great temptation, as happened with this, to just copy and paste what you need. This isn’t ideal for the reasons:

    • it increases the likelihood of data entry input error
    • it takes time and
    • it is just so very very tedious

    The solution is, always, to automate and tools like Power Pivot in Excel or SSIS could work but as the final report is in R it makes sense to tackle this formatting in R and this is the result…

    Import file

    For this you can either save the file manually or use the following code within R. Save it to the same place where the code is running and you should see the files in the bottom right window under the tab ‘Files’. The best way to do this is using project and opening up the script within that project.

                  destfile = "DeathsDownload.xls",
                  method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
                  mode = "wb") #wb means "write binary"

    Not that this file’s name and URL changes each week so the code needs changing each time it is run.

    Once the file is saved use readxl to import which means the file doesn’t need its format changing from the original .xls

    When I upload this file I get warnings which are related, I think, to the Excel serial numbers appearing where dates are expected.

    • sheet = :refers to the sheet I want. I think this has to be numeric and doesn’t use the tab’s title.
    • skip = :is the number of top rows to ignore.
    DeathsImport <- read_excel("DeathsDownload.xls ", 
                               sheet = 4,
                               skip = 2)
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in C5 / R5C3: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in F5 / R5C6: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in G5 / R5C7: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in H5 / R5C8: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in I5 / R5C9: got a date
    ## New names:
    ## * `` -> `..2`

    Formatting the data

    The next code creates a list that is used in the later code that is similar to the SQL IN but without typing out the list within the code for example:

    • SQL : WHERE city IN (‘Paris’,‘London’,‘Hull’)
    • R : filter(week_number %in% filter)

    These lines of code are base R code and so don’t rely on any packages.

    LookupList <- c("Week ended",
                "Total deaths, all ages",
                "Total deaths: average of corresponding",

    The next bit uses the dplyr package, which has loaded as part of tidyverse, as well as the janitor package. Not all packages are compatible with tidyverse but many do as this is often the go-to data manipulation package.

    As an aside the %>% is called a pipe and the shortcut is Shift + Ctrl + m. Worth learning as you’ll be typing a lot more if you type out those pipes each time.

    Janitor commands

    • Clean names: removes spaces in column headers and replaces with _
    • remove_empty: gets rid of rows and columns – this dataset has a lot of those!

    Dplyr command

    • filter: is looking just for the rows with the words from the list ‘LookupList’. These will become the column names later.
    DeathsImport2 <- DeathsImport %>% 
      clean_names %>% 
      remove_empty(c("rows","cols")) %>% 
      filter(week_number %in% LookupList) 

    There are great commands called gather and spread which can be used to move wide form data to long and vice versa but with this I noticed that I just needed to turn it on its side so I used t() which is also useful as it turns the data frame to a matrix. You can see this by looking in the ‘Environment’ window in the top right of R Studio; there is no blue circle with an arrow next to t_DeathsImport.

    t_DeathsImport <- t(DeathsImport2)

    Being a matrix is useful as the next line of code makes the first row into column headers and this only works on a matrix.

    colnames(t_DeathsImport) <- t_DeathsImport[1, ]

    Dplyr gives an error on matrices:


    t_DeathsImport %>% mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(Week ended)), date_system = “modern”))


    Error in UseMethod(“mutate_”) : no applicable method for ‘mutate_’ applied to an object of class “c(‘matrix’, ‘character’)”

    As later code will need dplyr turn the matrix into a dataframe using some base R code:

    t_DeathsImport <- as.data.frame(t_DeathsImport)

    Previous dplyr code filtered on an %in% bit of code and it’s natural to want a %not in% but it doesn’t exist! However, cleverer minds have worked out a function:

    '%!ni%' <- function(x,y)!('%in%'(x,y))

    The text between the ‘’ can be anything but I like’%ni%’ as it’s reminiscent of Monty Python.

    Because of the moving around of rows to columns the data frame now has a row of column names which is not necessary as well as a row with just ‘East Midlands’ in one of the columns so the following ‘remove’ list is a workaround to get rid of these two lines.

    remove <- c("E12000004", "East Midlands")

    The next code uses the above list followed by a mutate which is followed by a janitor command ‘excel_numeric_to_date’. This tells it like it is but, as often happens, the data needs to be changed to a character and then to numeric. The date system = “modern” isn’t needed for this data but as I took this from the internet and it worked, so I left it.

    An error will appear about NAs (nulls).

    t_DeathsImport <- t_DeathsImport %>% 
      filter(E12000004 %!ni% remove) %>% 
      mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern"))
    ## Warning in excel_numeric_to_date(as.numeric(as.character(`Week ended`)), :
    ## NAs introduced by coercion

    Now to deal with this mixing of real dates with Excel serial numbers.

    Firstly, the following code uses base R to confirm real dates are real dates which conveniently wipes the serial numbers and makes them NAs.

    t_DeathsImport$`Week ended` <- as.Date(t_DeathsImport$`Week ended`, format = '%Y-%m-%d')

    This results in two columns:

    • Week ended which starts off with NAs then becomes real dates and
    • serialdate which starts off with real dates and then NAs.

    The human eye and brain can see that these two follow on from each other and just, somehow, need to be squished together and the code to do it is as follows:

    t_DeathsImport <- t_DeathsImport %>% 
      mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    To translate the mutate, this creates a new column called date which, if the Week ended is null then takes the serial date, otherwise it takes the Week ended.

    Interestingly if ‘ifelse’ without the underscore is used it converts the dates to integers and these are not the same as the Excel serial numbers so use ‘if_else’!

    And that’s it.

    Or is it?

    You might want to spit out the data frame back into excel and that’s where a different package called openxlsx can help. As with many things with R, “other packages are available”.

    write.xlsx(DeathsImport, 'ImportProvisionalDeaths.xlsx')

    If you haven’t used a project (which is really the best way to work) this will probably save in some obscure C: drive that you’ll see in the bottom left ‘Console’ just under the tab names for ‘Console’ and ‘Terminal’. Using projects means you set the pathway and that will mean the file saves in the same place and will also appear in the bottom right panel under ‘Files’.


    I’m pretty early on in my journey in R and many of my colleagues still haven’t started yet so I’m throwing this out there so everyone can see it, newbies and old hands alike. If you spot anything, can explain anything further, need more explanation or can offer any alternatives to what I’ve done please please feel free to comment.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

  • Using R to track NHS winter pressures

    Every Thursday during winter, roughly from December to March, NHS Digital releases a week’s worth of hospital performance data, known as the Daily Situation Reports. This data often receives media attention because cold weather and seasonal illnesses can lead to higher demand for hospital care, meaning that services might be under increased pressure. At the Health Foundation, one of our aims is to provide new insight into the quality of health care through in-depth analysis of policy and data. So, to understand more about the current demand for hospital care and the challenges the NHS is facing, we keep a close eye on the latest seasonal trends.

    Keeping on top of NHS winter indicators has the potential to keep us analysts busy. The raw data is published in a fairly complex spreadsheet, requires a decent amount of cleaning and needs to be reanalysed after every release. In addition to monitoring national figures, this winter our team at the Health Foundation also wanted to see if there might be any variation between different areas of England. Sustainability and transformation partnerships (STPs) are areas where health and care leaders develop shared proposals for local services. Therefore, we enriched the raw data with information about where hospitals are located, and which STP they belong to. But with a similar analytical approach, more fine-grained local structures (such as Clinical Commissioning Groups) could be used.

    For a more efficient and reproducible way of tracking NHS winter indicators this year, we moved to our whole analysis pipeline to R. We then used the clean data for visualisations in R and other applications, like Tableau. This blog takes you through our analysis workflow and describes how we got through some tricky steps. The complete R script is also available on GitHubif you want to give it a go yourself. You can also read a blog on the Health Foundation’s website to find out why we looked at local areas this year and what we found.

    Why write this blog on data analysis?

    Analysts at many other local and national organisations are interested in NHS winter performance data. In order for this blog to be a good resource for them, we plan to:

    • share our own analytical approach and R code
    • illustrate how and why we made certain analytical decisions and
    • discuss what we learned along the way, both about the data and R.

    We hope that this blog will inspire others to do the same, and to share their code too. Here, we won’t try to interpret any regional differences in winter performance. We know that there are several factors involved, so we will leave this up to others with more local knowledge.

    What does this blog cover?

      1. R packages we used
      2. Data download from NHS Digital
      3. Data import from R-unfriendly spreadsheets: how we tackled multi-line headers with merged cells
      4. Data cleaning: how we defined and dealt with missing data
      5. Feature engineering: how we added organisational information on sustainability and transformation partnerships (STPs)
      6. Aggregation: monthly averages within local STPs
      7. Visualisation: how we mapped STP-level data in R
      8. What have we learned?

    1. R packages we used

    The tidyverse collection of R packages is a useful set of tools for data analysis and visualisation that are designed to work together. It contains the ggplot package, which we use for visualisations. (Need help getting started with R and the tidyverse? Try the R for Data Science website).

    We use a function from the readxl package to import Excel files and the lubridate package, which makes working with dates a lot easier. Both are part of the tidyverse, but not loaded automatically along with the tidyverse package.

    The geojsonio package is one option that helps us import geographic data structures, such as STP boundaries. The broom and maptools packages are then needed to prepare these for visualisation with ggplot.


    Note: if you would like to run this code in your own R session and you haven’t installed these packages already, you will need to do so (once) using the install.packages() function.

    2.  Data download from NHS Digital

    After setting up the tools, we downloaded Winter Daily SitRep 2018–19 Data from the NHS Digital website. Rather than opting for the spreadsheets that contain one week of data each, we went for the more convenient time series file, which contains all data from this winter up to the latest release. One drawback is that the name of this file changes weekly as new data is added (so, should you try to use this code at a later date, you will probably have to adapt the file name).


    3. Data import from R-unfriendly spreadsheets

    How we tackled multi-line headers with merged cells

    Once we opened the file in Excel, we saw a number of sheets containing different indicators. With a few exceptions, most have a similar table structure. As we were interested in patient flow through hospitals, we focused on ‘General and acute beds’ and ‘Beds occupied by long-stay patients’ for now.

    What the sheets with these indicators had in common was that there was metadata in the first 13 lines followed by a two-line header. Several columns containing variables (second header line) were grouped within dates (first header line) and the cells around the dates were merged. There were also some empty columns and rows, which we addressed later on.

    Example of the Excel layout of Winter SitRep data to be imported into R. Source: NHS Digital.

    Unfortunately, this was not yet a tidy table, as the dates that served as column headers were values, not variable names. All this made importing the file into R slightly more challenging. We fixed this by creating a custom import_sitrep() function that would:

    1. read and store the data and the two lines of the header separately,
    2. combine the two lines of the header,
    3. add the combined header to the data and tidy up column names by removing special characters,
    4. and finally convert the table into a tidy table ready for the next step.

    But wait, there was one more tricky bit. What happened when we tried to read in the headers one by one?


    *Sigh*… they ended up not having the same length. The first header line (containing the dates) was 8 elements shorter. Looking at its left and right side (see above) gave us a hint as to why this might have happened:

    • In the Excel sheet, the first few cells in this line were empty and when the line was read in, they were converted to NA. The read_xlsx() function then discarded these empty columns (probably) because they were at the beginning.
    • There were also some merged cells. During import they were separated and, if empty, converted to NA. Empty columns at the end of the header line also seem to be discarded by read_xlsx().

    So, we needed to find a way to preserve the length of the first header line in our import_sitrep() function. This is how we solved it:


    Now that we had our import function, we were ready to read and combine the sheets containing the winter indicators ‘General and acute beds’ and ‘Beds occupied by long-stay patients’.


    The data was now tidy, as each variable formed a column and each observation formed a row. Note that the observations for England in the table were followed by values for individual hospital trusts.

    4.  Data cleaning

    Trust exclusions

    We excluded the three children’s hospitals when calculating aggregate measures, such as the average bed occupancy within STPs. Our reasoning was that their patient profiles would be different from other acute trusts and this might skew the averages. Nevertheless, we kept track of them at a trust level.

    This applies to Birmingham Women’s and Children’s NHS Foundation Trust (code RQ3), Alder Hey Children’s NHS Foundation Trust (RBS) and Sheffield Children’s NHS Foundation Trust (RCU).


    How we approached defining ‘missingness’: when is a zero not a zero?

    Data collection and validation errors do happen, so finding suspicious data points is an important step during data cleaning.

    While this is easy if a value is missing (or NA), it’s much harder to decide whether a zero truly represents ‘zero events’ or a missing value (in fact, it could even be both within the same data set). To distinguish between the two, at the Health Foundation we came up with the following criteria:

    • How likely is a ‘zero event’ for an indicator? For example, when counting beds in a large hospital the likelihood of having zero open seems small, but when counting long-stay patients having none seems possible.
    • How consistent is the zero value, in that trust, over time? Or in plain English: does the value jump from a higher number to zero (and back) or is it hovering somewhere close to zero.

    The next two sections describe how we found and dealt with these missing values.

    Finding longer periods of missing data

    If any hospital trust had missing values, in any indicator, on 4 or more consecutive days during the reporting period, it was excluded from the analysis. We were only looking for these periods in variables where we would not expect any zeros (the list is shown as cols_to_check).

    Why this particular cut-off? We wanted to aggregate the data and calculating weekly averages did not seem justified if we were missing more than half of a week for any trust.

    Here’s how we summed up how many consecutive days were zero or NA within each trust/variable combination:



    When we filtered for 4 or more consecutive days, we found that:

    • The trust with the code RTD reported zero long-stay patients (of any length of stay) for the whole reporting period to date, which seemed unrealistic for a general and acute hospital.
    • Trust RQW had a gap of 7–8 days, that coincided for the indicators shown (we checked this separately in the raw data).
    • Trust RAX reported zero long-stay patients (of any length of stay) for 6 days during January, but reported a high number before and after.

    Based on this, all variables from the trusts RTD, RQW and RAX were excluded from the analysis of this year’s (2018/19) winter data. This left us with 128 out of 134 trusts.

    It’s worth noting that with this data-driven approach different trusts might be excluded each year and the number of excluded trusts could change over the winter period as new ‘gaps’ appear. Keep this in mind when making comparisons, both throughout the winter period and between years.


    Dealing with shorter data gaps

    Next, we checked how many missing or zero values were left:


    Most of the remaining gaps (42 out of 54) consisted of only a single day and they were mostly found in variables relating to long-stay patients. To judge whether these looked like real ‘zero events’ or were more likely to be reporting errors, we had a closer look at the data:


    Before cleaning: plots showing the subset of trusts reporting ‘0’ in any (non-derived) variable.

    Based on the data before and after the zeroes, these were unlikely to be true values. It would have been possible to impute these gaps in some way, for example by taking the mean of day before and the day after. Instead, we took the approach that required fewest assumptions and we just replaced the gaps with NA:


    After cleaning: plots showing the same trusts after replacing zeros with NA.

    Validating derived variables

    Some variables present in the data set were derived from others: for example, total.beds.occd should be the sum of core.beds.open and escalation.beds.open.

    We could have discarded derived variables straightaway and then computed them ourselves, in order to be completely sure about how they have been derived and what they mean. Since we were curious about their quality, we first checked if total.beds.open and occupancy.rate had been calculated as expected so we could decide whether or not to replace them (spoiler: yes, we did).


    Similarly, if it had been the focus of our analysis, we would also have rederived national aggregates for England.

    5.  Feature engineering

    Adding organisational information on sustainability and transformation partnerships (STPs)

    As we wanted to compare indicators between local areas, we decided to calculate averages of winter indicators over hospital trusts within each STP. To do this, we needed to add variables to the raw data that indicated which hospital trust belonged to which STP. Unfortunately, we were not aware that this information was available in a format convenient for data analysis.

    So, we manually created and validated a lookup table to map hospital trusts to STPs, using information related to the 2017/18 formation of 44 STPs from NHS England. While some STPs have since changed (for example, three STPs in the north of England merged in August 2018), this was the latest and most comprehensive information available, as far as we are aware.

    The allocation of most hospital trusts to STPs was straightforward using this list, but there were a few instances where we had to choose:

    • If a trust did not appear in any STP list, it was matched according to the location of its main acute site. This was the case for four trusts in Northumberland, Tyne and Wear and North Durham STP.
    • If a trust was mentioned in more than one STP plan, it was allocated according to the location of its main acute site. This applied to both Chesterfield Royal Hospital NHS Foundation Trust and Epsom And St Helier University Hospitals NHS Trust.

    We think this is a reasonable approach when analysing winter indicators, which mainly come from acute trusts, but we would be keen to hear your feedback.

    Once we had this lookup table, we imported it into R and merged it with the winter indicators:


    The lookup table is also available on Github. Please note that STPs change and develop over time, so if you would like to use it, it’s worth checking that the information is up to date.

    Making read-outs comparable between trusts: from raw counts to rates

    As hospital trusts come in different sizes and serve different numbers of patients, raw patient counts are not suitable for comparisons between trusts or regions. Percentages or fractions, such bed occupancy rates, are more comparable.

    Therefore, we derived the fraction of occupied beds, which are occupied by long-stay patients over 7, 14 or 21 days:


    6.  Aggregation: monthly averages by STP

    In some cases, it might be useful to visualise daily changes, but weekly or even monthly aggregates have the advantage of being less noisy, free of weekday-weekend variation and can potentially be more useful to monitor longer-term trends.

    First, we created a new column that contained the corresponding month. The month was then used as grouping variable, along with the trust or STP code, to calculate monthly averages of bed occupancy and long-stay patient rates.

    For weekly averages, an alternative would have been to create a new column containing the date of the respective beginning of the week using the cut() function (also shown below).

    We know it’s also good practice to keep track of the number of valid observations (as in, not NA) that we average over within each group used. In this instance, for trust-level aggregates, this represented the number of days within a week. For STP-level aggregates, it corresponded to the number of trusts within the STP.


    At this point, we also saved the tidy and aggregated data as a CSV file for visualisation in other programs, such as Tableau.

    7.  Visualisation: how we mapped STP-level data in R

    There are endless ways to creatively visualise aspects of this data (the R Graph Gallery is a great place to get inspired). We wanted to plot a map of STP boundaries and colour-shade them according to the average bed occupancy in each winter month.

    STP boundaries are available as a GeoJSON file from the Office for National Statistics (ONS). We downloaded and imported the digital vector file and then created a plot to get a first idea of what was in the file:


    Boundaries of Sustainability and Transformation Partnerships in England (2017). Data source: ONS.

    Before spatial objects can be used with ggplot, they have to be converted to a data frame. This can be done using the tidy() function from the broom package. To add our data to the map, we then merged the resulting data frame with the winter data.

    The cut() function provided a convenient way to divide the variable bed.occupancy into meaningful intervals and to add labels that could be displayed on the map. Converting variables into factors and ordering the factor levels using factor() ensured that the intervals and months were in the right order. We were then ready to plot and save the map:


    How could we have improved this map further? One option might have been to interactively display STP averages when hovering over the map using the plotly package, for example.

    8.  What have we learned?

    Lots! No, seriously… Using this workflow, rather than copy and paste, has saved us a lot of time this winter. But beyond that, creating (then revisiting and improving) this workflow turned out to be a great way to work together and to make our analysis more transparent, more reproducible and easier to understand.

    Key points we are taking away:

    • With trusts, CCGs, STPs, ICSs, and more to choose from when we’re looking at local variation within health care systems, it can be challenging to pick the right organisational level. Local health care structures are also changing and evolving rapidly. We need to do more thinking about which level is most informative and compare different option, suggestions are welcome.
    • Some cleaning is a must. Winter data has a quick turnaround and NHS Digital only perform minimal validation. However, compared to previous years, the 2018/19 winter data quality has markedly improved (you can use our code to check the improvement for yourself).
    • But cleaning decisions impact interpretability. If the list of excluded trusts changes between years (based on how well they are reporting in any given year), this makes the data less comparable year-on-year.
    • We’ll play it safe with derived variables.The effort of rederiving them ourselves was worth it for the peace of mind we got from knowing that they were calculated consistently and meant exactly what we thought they meant.
    • Next winter, our future selves will be very happy to have our analysis nicely documented with code that is ready to go. It should even be easy to adapt the same workflow for other open NHS data sets (we’ll save this for another time).

    This blog was written by Fiona Grimm, Senior Data Analyst in the Data Analytics team at the Health Foundation. @fiona_grimm

  • Animating a Graph Over Time in Shiny













    I was interested in creating an animated graph in Shiny, which allows for an interactive input. As a mock example, I created an app which shows the distribution of a waitlist’s volume over time.

    Excel offers the ability to easily create graphs, though when it comes to animating graphs, the process can be tricky, using complicated VBA. In my opinion, R is far better suited to dealing with this, due to the object oriented nature.

    Simple, right?

    When I first had a go with making a simple animated plot, I thought it would be as simple as:

    • Create vector of distribution
    • Plot this vector as a bar chart
    • Update the new vector every time there is a trigger

    Unfortunately, that’s when I ran into reactivity.


    In short (and to my understanding), if you have anything in R Shiny that will be updated, it should be within a reactive environment. Flows of information in R work by pull, rather than push mechanisms. This means the data is “looked at” only when needed by the output (ie when it is created), but the data doesn’t push forward any change to the output if it updates.











    Above, case 1 represents the normal situation in R once data has been updated. In case 2, a reactive environment is used, which effectively reverses the direction of the flow of information in R.

    Another, perhaps more relatable way of interpreting reactivity, is to imagine reading in a csv file to a table in R called, for example, my_table. If you were to update values in the csv file, the object my_table would not be updated; it would require the data to be re-read into R.

    Reactive objects can live inside something called reactiveValues, which acts like a list. There are 2 ways to create / update this:

    • Create the reactive list using my_reactive_list <- reactiveValues(), then update values in that list using $ (eg my_reactive_list$somedata <- c(1,20,23,42,98) )
    • Create the items while creating the list (eg my_reactive_list <- reactiveValues(somedata = c(1,20,23,42,98)) )

    To then use this data, simply call the object from the output you would like to use it for. For example, ggplot(my_reactive_list$somedata). When somedata changes, so will the plot output.

    It is worth noting that, just like lists, items within the reactiveValues object are not limited to vectors.

    More about reactivity can be found here:



    Here is the code I used to create this app:


    Though the code is long, a large proportion of that is for inputs and layouts; a basic animated graph can be created using some more simple code:

    • Data in reactiveValues() form
    • A plot output
    • A trigger, you could use an actionButton(), or a reactiveTimer()

    Shiny consists of a ui and a server element. Any visuals (eg a button, a plotoutput) should be held within the ui, then any functions or calculations which happen behind the scenes should live within the server function.

    I generally find it useful to use a series of small functions in R, and especially in Shiny. Instead of putting all the code inside of an actionButton, calling a function allows scope to implement it with different triggers, such as a reactiveTimer, or looping. It also allows for easier reading, in my opinion.

    When initially attempting to create the app, I was taken back by the complexity of Shiny, and faced a steep learning curve. I took a few days to learn the basics, and after some practice I feel I know what is available in the Shiny toolbox, and some ways to work around problems.

    This blog was written by Dan Mohamed, Undergraduate Analyst at the NHS Wales Delivery Unit.

  • A run chart is not a run chart is not a run chart

    Understanding variation using runs analysis

    Run charts are simple and powerful tools that help discriminate between random and non-random variation in data over time – for example, measures of healthcare quality.

    Random variation is present in all natural processes. In a random process we cannot know the exact value of the next outcome, but from studying previous data we may predict the probability of future outcomes. So, a random process is predictable. Non-random variation, on the other hand, appears when something new, sometimes unexpected, starts to influence the process. This may be the result of intended changes made to improve the process or unintended process deterioration. The ability to tell random from non-random is crucial in quality improvement. One way of achieving this is runs analysis.

    In general, a run is defined as a sequence of like items that is preceded and followed by one or more items of another type or by no item. Items can be heads and tails, odd and even numbers, numbers above and below a certain value, etc.

    Runs analysis is based on knowledge of the natural distributions and limits of run lengths and the number of runs in random processes. For example, if we toss a coin 10 times and get all heads (1 run of length 10), we would think that something non-random is affecting the game. Likewise, if a run chart includes a run of 10 data points on the same side of the centre line, we would start looking for an explanation.

    Figure 1: Run charts from random numbers. A: random variation. B: non-random variation in the form of an upwards shift introduced after data point number 16 and identified by unusually long and few runs and signalled by a red, dashed centre line. See text for details on how to identify non-random variation.

    Specifically, a run chart may be regarded as a coin tossing game where the data points represent heads and tails depending on their position above and below the centre line (ignoring data points that fall directly on the centre line). If the process is random, the data points will be randomly distributed around the centre (Figure 1A). A shift in process location will affect the distribution of data points and will eventually present itself by non-random patterns in data, which can be identified by statistical tests (Figure 1B).

    Swed and Eisenhart studied the expected number of runs in random sequences. If the number of runs is too small or too large, it is an indication that the sequence is not random (1). To perform Swed and Eisenhart’s runs test one must either do rather complicated calculations or look up the limits for the expected number of runs in tables based on the total number of runs in the sequence and the number of items of each kind. Simplified tables for use with run charts have been developed for up to 60 data points (2). For more than 60 data points, the limits can be calculated using the normal approximation of the runs distribution function (3). For example, in a run chart with 24 data points, the expected number of runs (95% prediction limits) is 8-18.

    Chen proposed an alternative to Swed and Eisenhart’s method. Instead of counting the number of runs, Chen counts the number of shifts in the sequence, i.e. when a value of one kind is followed by a value of another kind, which is one less than the number of runs (4). To avoid confusing Chen’s shifts in sequence with shifts in process location, I use the term crossings.

    In run charts, crossings are easily counted by counting the number of times the graph crosses the median line. If the process is random, the chance of crossing or not crossing the median line between two adjacent data points is fifty-fifty. Thus, the total number of crossings has a binomial distribution, b(n1,0.5), where n is the number of data points and 0.5 is the success probability.

    We should consider whether we are interested in both sides of the distribution, too few and/or too many crossings. By nature, a shift in process location following process improvement or deterioration will result in fewer crossings than expected. But unusually many crossings (oscillation) is also a sign of non-random variation, which will appear if data are negatively autocorrelated, that is, if any high number tends to be followed by a low number and vice versa. However, oscillation is not an effect of the process shifting location, but most likely a result of a poorly designed measure or sampling issues (5 p175). Chen recommends using one-sided tests suited for the purpose of the analysis, i.e. whether one is interested in detecting shifts or oscillations (4).

    For example, for a run chart with 24 data points we could choose the lower fifth percentile of the cumulative binomial distribution of 23 trials with a success probability of 0.5 as our critical value for the lower limits of crossings. This is easily calculated in R using the qbinom() function, qbinom(p = 0.05, size = 24 - 1, prob = 0.5) = 8, i.e. fewer than 8 crossings would be unusual and suggest that the process is shifting. In Figure 1B non-random variation in the form of a shift is identified by the fact that the chart has only 6 crossings when at least 8 would be expected from 24 random numbers.

    The number of crossings (and runs) is inversely related to the lengths of runs. All things being equal, fewer crossings give longer runs and vice versa. Therefore, a test for unusually long runs is also commonly used as a means to identify shifts. A simple example is the “classic” rule of thumb of a run of 8 or more data points on the same side of the centre line. But just like the expected number of crossings, the expected length of the longest run depends on the total number of data points. In a run chart with, say, 100 data points, we should not be surprised to find a run of 8.

    The distribution of longest runs has been described in detail by Schilling (6–8). The expected length of the longest run either above or below the median is log2(n), where n is the total number of data points, excluding data points that fall directly on the centre line. Approximately 95% of the longest runs are predicted to be within ±3 of the expected value. For the purpose of detecting a shift, we are interested in the upper prediction limit for longest run, which is log2(n)+3 (rounded to the nearest integer). For example, in a run chart of 24 data points, the upper 95% prediction limit for the longest run is round(log2(24) + 3) = 8, i.e. a run of more than 8 indicates a shift. Figure 1B has an unusually long run of 9 consecutive data points on the same side of the centre line.

    A trend is a special form of a run, where like items are defined as data points that are bigger or smaller than the preceding one. The trend test was developed by Olmstead who provided tables and formulas for the probabilities of trends of different lengths depending on the total number of data points (9). For example, with less than 27 data points in total, the chance of having a trend of 6 or more data points going up or down is less than 5%. Note that Olmstead defines a trend as the number of jumps rather than the number of data points that surround the jumps.

    In summary, there are (at least) four unusual run patterns that may be used to identify non-random variation in run charts:

    • Too many runs
    • Too few runs
    • Too long runs
    • Too long trends

    The selection of rules and the choice of critical values to define too manytoo few and too long have significant influence on the statistical properties of run charts. This is the subject of the following sections.

    Signal confusion

    Compared to control charts, surprisingly little has been published on the statistical properties of run charts (2, 10–13).

    Carey introduced four run chart rules(10):

    1. too much variability: more runs than expected,
    2. too little variability: fewer runs than expected,
    3. a shift: a run of 8 or more data points on the same side of the median (if less than 20 data points in total, a run of 7 suggests a shift),
    4. a trend: a run of 6 (or 7) data points going up or down.

    Carey provided a probability table showing the expected range of number of runs based on Swed’s work (1).

    Perla, Provost, and Murray suggested a different set of run chart rules (2):

    1. a shift: a run of 6 or more data points on the same side of the median,
    2. a trend: a run of 5 data points going up or down,
    3. too much variability: more runs than expected,
    4. too little variability: fewer runs than expected,
    5. an astronomical data point: an unusually large or small number.

    Perla’s rules differ from Carey’s mainly in that the critical values used to detect shifts and trends are relaxed. Perla notes that rules 1-4 are based on an α error of p<0.05. However, the source of this information is unclear. Also, Perla does not tell whether the α error applies to each individual rule or the set as a whole.

    Finally, Olesen and I have proposed yet another set of run chart rules, which – for lack of a better term – I refer to as the Anhøj rules (11):

    1. Shift rule: A shift is present if any run of consecutive data points on the same side of the median is longer than its prediction limit.
    2. Crossings rule: A crossings signal is present if the number of times the graph crosses the median is smaller than its prediction limit.

    The prediction limits for longest run and number of crossings both depend on the number of useful observations (data points not on the median) and are based on the works of Schilling (8) and Chen (4) as described in the previous section. Prediction limits for 10-60 data points are tabulated in the appendix.

    In general, the Anhøj rules are more conservative (less sensitive, more specific) than the Carey and Perla rules. For example, with 24 data points the Anhøj rules take a run of 9 to signal a shift, while Perla and Carey only take 6 and 8, respectively (Figure 2).

    Figure 2: Run chart with random variation according to the Carey and Anhøj rules but with both a shift and a trend according to the Perla rules. The chart has 30 observations of which 27 are not on the median. The longest run has 6 data points below the median, there are 11 crossings, and the longest trend has 6 data points (excluding a tie) going down.

    Which rules to rule?

    So, which run chart rules are best? To answer that question, we need to agree on what we mean by “best”. The purpose of any data display is to help us make good decisions. The link between data and decisions is prediction. As W Edwards Deming put it A curve fitted to a set of points is of interest, not on account of the data fitted, but because of data not yet fitted. How will this curve fit the next batch of data?. Specifically, regarding run charts, the question of interest is: how well does the centre (e.g. median) of historical data represent the centre of future data? In a random process, historical data reliably predicts future data. In a non-random process, this may or may not be the case. In practice, however, we cannot know in advance if a process is random or non-random, which is why we need statistical tests (rules) to help us make the destinction. But even the most reliable statistical tests make mistakes by sometimes overlooking the condition being tested for (false negative) and sometimes identifying a condition that is not there (false positive).

    The questions of interest for run chart users are what is the chance that a chart with a positive test really represents non-random variation? and what is the chance that a chart with a negative test really represents random variation?

    Likelihood ratios are diagnostic measures designed to answer such questions. The positive likelihood ratio, LR+, is the true positive rate divided by the false positive rate. The negative likelihood ratio, LR–, is the false negative rate divided by the true negative rate:


    A likelihood ratio greater than 1 speaks in favour of the condition being tested for, which in our case is non-random variation, while a likelihood ratio less than 1 speaks against the presence of non-random variation. The further a likelihood ratio is from 1, the more or less likely is the presence of non-random variation.

    Thus, likelihood ratios allow us to quantify the probability of non-random variation in data sequences and are useful quality characteristics of run (and control) chart rules.

    Figure3: Likelihood ratios of different sets of run chart rules. See text for details.

    In Run chart diagnostics I studied and compared likelihood ratios for the Perla, Carey and Anhøj rules (12) using simulated random data series to mimic run charts.

    In half the simulations, a shift in process mean of 2 SD was introduced. So, for each simulated run chart the true presence or absence of a shift together with the presence or absence of signals from the runs analyses was known by the simulation program allowing the program to calculate likelihood ratios for each set of run chart rules.

    For each simulated data series, the median was calculated using the first 6, 12 or 18 data points as baseline. And the shifts (when present) were introduced in the post-baseline period of 6, 12 or 18 data points. Consequently, there were nine combinations of baseline and post-baseline periods of different lengths allowing me to study the influence of these parameters on the diagnostic value of the tests.

    For each of the nine possible combinations of baseline and post-baseline length, 1,000 simulations were performed with and 1,000 without post-baseline shift. In total, 18,000 run charts were simulated.

    The results are summarised in Figure 3. Overall, the Anhøj and Carey rules perform “better” than the Perla rules – the Anhøj rules slightly but consistently better than the Carey rules. For run charts with 12 or more data points in the post-baseline period, the Anhøj and Carey rules perform very well with positive LRs around 10 and negative LRs around 0.1. The interpretation is, that given a positive test based on the Anhøj or Carey rules, the presence of a shift of 2 SD is about 10 times more likely than no shift; and given a negative test, a shift is about 10 times less likely than no shift.

    In the previous paragraph, I use the term “better” to describe likelihood ratios with a wide and symmetric range centered around 1. Symmetry is important if we want equal protection from false positive and false negative results. The Perla rules have very low negative LRs meaning that negative tests with great certainty rule out shifts. However, they also have low positive LRs suggesting that positive tests are only 2–4 times more likely to be associated with true shifts in process performance than not.

    The main reason for the low value of positive tests signalled by the Perla rules is the trend rule, which has been shown to add little more than false signals to runs analysis(14). In particular, the Perla version of the trend rule (5 point in a row) is flawed, having a false positive signal rate around 20%. But also Perla’s shift rule signalling when only 6 data points in a row is on the same side of the median is responsible for a significant number of false alarms.

    qicharts2: Quality improvement charts with R

    The qicharts2 package for R(15) employs the Anhøj rules as standalone rules with run charts and as sensitising rules with control charts (Figure 3).

    The latest stable version of qicharts2 can be installed from CRAN:


    The current development version can be installed from GitHub:

    devtools::install_github('anhoej/qicharts2', build_vignettes = TRUE)

    Along with the package comes a vignette that explains its use and provides several examples using actual healthcare data:


    Run charts or control charts?

    It is a common misunderstanding that run charts are inferior to control charts. As Figure 4 clearly demonstrates, the runs analysis is more sensitive to minor persistent shifts in data than are the control limits that only reacts to larger shifts in data. In a recent paper, Tore Wentzel-Larsen and I showed that runs analysis with the Anhøj rules is actually comparable to or even better than traditional control chart rules to identify minor to moderate and persistent shifts in data (16).

    Figure 4: Control chart of random normal variables. In the last 8 data points a shift of 2 SD was introduced. All data points are within the control limits, but the shift is signalled by an unusually long run of 9 data points.

    In my view, run and control charts are two sides of the same coin, and I refer to them collectively as statistical process control charts. I recommend to always use the Anhøj rules first together with a visual inspection of the run chart for other signs of non-random variation. If – and only if – the run chart finds random variation and the extra sensitivity of control limits to identify larger, possibly transient, shifts in data is needed, a control chart may be useful.

    For improvement work, run charts are usually all we need, because we are expecting persistent shifts in data. For quality control – when a process has been stabilised at a satisfactory level – control limits are useful to quickly identify sudden larger unwanted shifts in data.


    In conclusion, several sets of run chart rules with different diagnostic properties are currently available. The Perla rules recommended by the Institute for Healthcare Improvement (US) and widely used within the National Health Services (UK) have poor diagnostic value mainly due to a high false signal rate. The Anhøj and the Carey rules have better diagnostic properties that reliably tell random from non-random variation and balance the risk of false positive and false negative signals.

    Appendix: Critical values for longest run and number of crossings























    1. Swed FS, Eisenhart C. Tables for testing randomness of grouping in a sequence of alternatives. The Annals of Mathematical Statistics [Internet]. 1943 [cited 2018 Oct 7]; 14:66–87. Available from: http://www.jstor.org/stable/2236004.

    2. Perla RJ, Provost LP, Murray SK. The run chart: A simple analytical tool for learning from variation in healthcare processes. BMJ Qual Saf [Internet]. 2011 [cited 2018 Oct 7]; 20:46–51. Available from: http://qualitysafety.bmj.com/content/20/1/46.

    3. Provost LP, Murray SK. The health care data guide: Learning from data for improvement. San Francisco: John Wiley & Sons Inc.; 2011.

    4. Chen Z. A note on the runs test. Model Assisted Statistics and Applications [Internet]. 2010 [cited 2018 Oct 7]; 5:73–7. Available from: https://content.iospress.com/articles/model-assisted-statistics-and-applications/mas00142.

    5. Western Electric Company. Statistical quality control handbook. New York: Western Electric Company inc.; 1956.

    6. Schilling MF. The longest run of heads. The College Mathematics Journal [Internet]. 1990 [cited 2018 Oct 7]; 21:196–207. Available from: https://www.jstor.org/stable/2686886.

    7. Schilling MF. Long run predictions. Math Horizons [Internet]. 1994 [cited 2018 Oct 7]; 1:10–2. Available from: http://www.jstor.org/stable/25677964.

    8. Schilling MF. The surprising predictability of long runs. Mathematics Magazine [Internet]. 2012 [cited 2018 Oct 7]; 85:141–9. Available from: http://www.jstor.org/stable/10.4169/math.mag.85.2.141.

    9. Olmstead PS. Distribution of sample arrangements for runs up and down. The Annals of Mathematical Statistics [Internet]. 1946 [cited 2018 Oct 7]; 17:24–33. Available from: http://www.jstor.org/stable/2235901.

    10. Carey RG. How do you know that your care is improving? Part 1: Basic concepts in statistical thinking. J Ambulatory Care Manage. 2002; 25(1):80–7.

    11. Anhøj J, Olesen AV. Run charts revisited: A simulation study of run chart rules for detection of non-random variation in health care processes. PLoS ONE [Internet]. 2014 [cited 2018 Oct 7]. Available from: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0113825.

    12. Anhøj J. Diagnostic value of run chart analysis: Using likelihood ratios to compare run chart rules on simulated data series. PLoS ONE [Internet]. 2015 [cited 2018 Oct 7]. Available from: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0121349.

    13. Anhøj J, Wentzel-Larsen T. Sense and sensibility: On the diagnostic value of control chart rules for detection of shifts in time series data. BMC Medical Research Methodology [Internet]. 2018 [cited 2018 Oct 7]. Available from: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0564-0.

    14. Davis RB, Woodall WH. Performance of the control chart trend rule under linear shift. Journal of Quality Technology. 1988; 20:260–2.

    15. Anhøj J. Qicharts2: Quality improvement charts for r. JOSS [Internet]. 2018 [cited 2018 Oct 7]. Available from: https://joss.theoj.org/papers/10.21105/joss.00699.

    16. Anhøj J, Hellesøe A-MB. The problem with red, amber, green: The need to avoid distraction by random variation in organisational performance measures. BMJ Qual Saf [Internet]. 2016 [cited 2018 Oct 7]. Available from: https://qualitysafety.bmj.com/content/26/1/81.

    This article is based on previously published, peer reviewed articles and large portions of the text has been copied from its sources. 

    Jacob Anhøj

  • A simple function to create nice correlation plots

    The problem

    I was working with a dataset where I wanted to assess the correlation of different variables in R. As much as I like R – the outputs from the console window leave something to be desired (in terms of data visualisation). Therefore, I wanted a way to visualise these correlations in a nicer / cleaner / crisper way. The solution to this is to use a correlation plot.

    Loading the correlation plot package

    The package I used for creating my correlation plots was the corrplot package, this can be installed and loaded into the R workspace by using the syntax below:


    At this point I would encourage you to check out help for the corrplot function, as it allows you to pass a multitude of parameters to the function.

    Deconstructing the function

    As mentioned previously, this plotting function has a multitude of uses, but all the parameters can be off putting to a newbie! This was me 6 years ago vigorously typing ‘how to do this with R relating to x’ and bombarding  stackoverflow and other useful websites with questions. Shout out to RBloggers as well!

    The function I have created uses the functionality of the corrplot packages, but it simplifies the inputs. I will include the function in stages to explain each step, however, if you just want to use the function and are not bothered with the underpinnings then skip the following section:

    Step 1 – Function parameters

    Parameters of the function are as below:

    create_gh_style_corrplot <- function(df_numeric_vals,
                                         colour_max="green") {

    The parameters to pass to the function are:

    1. df_numeric_vals this means a data frame of numeric values only, so any categorical (factor) data needs to be stripped out before passing the data frame to the function;
    2. method_corrplot this is a numeric range from 1 – 5. So, for a shaded correlation plot you would use 1. Further examples of the various options will be discussed when I describe how the if statement works.
    3. colour_min this uses a gradient colour setting for the negative positive correlations. An example of an input here would be “green”.
    4. colour_middle this is the middle range colour, normally I set this equal to (=) “white”.
    5. colour_max this is the colour of the strong positive correlations

    For information on the strength of correlations, refer to this simple guide.

    Step 2 – Creating the conditional (IF statement) to select correlation plot type

    The below conditional statement uses the input of the function e.g. 1-5 to select the type of chart to display. This is included in the code block below:

    if(method_corrplot == 1 ){
      type_var <- "shade"
      method_corrplot = type_var 
    else if (method_corrplot ==2) {
      type_var <- "number"
      method_corrplot = type_var
    else if (method_corrplot ==3) {
      type_var <- "pie"
      method_corrplot = type_var
    else if (method_corrplot ==4) {
      type_var <- "ellipse"
      method_corrplot = type_var
    else if (method_corrplot ==5) {
      type_var <- "circle"
      method_corrplot = type_var
      type_var <- "shade"
      method_corrplot <- type_var

    What does this do then? Well firstly nested in the function I make sure that the corrplot library is referenced to allow for the correlation plot functionality to be used. The next series of steps repeat this method:

    • Basically, this says that if the method_corrplot parameter of the function equals input 1, 2, 3, etc – then select the relevant type of correlation plot.
    • The type_var is a variable that sets the value of the variable equal to the string stated. These strings link directly back to the parameters of the corrplot function,  as I know a type of correlation plot is equal to shade or number, etc.
    • Finally, the last step is to convert method_corrplot equal to the textual type specified in the preceding bullet.

    In essence, what has been inputted as numeric value into the parameter i.e. 1; set the type_var equal to a text string that matches something that corrplot is expecting and then set the method_corrplot variable equal to that of the type variable. Essentially, turning the integer value passed into the parameter into a string / character output.

    Step 3 – Hacking the corrplot function

    As specified in the previous sections, this function has a lot of inputs and is in need of simplifying, so that is exactly what I have tried to do. The corrplot function is the last step in my more simple function to take lots of parameters and simplify down to just 5 input parameters:

    corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
             order = "AOE",
             addCoef.col = 'black',
             number.cex = 0.5,
             tl.cex = 0.6,
             tl.col = 'black',
             col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
             cl.cex = 0.3)

    Let’s explain this function.

    So, the corrplot function is the main driver for this and the second nested cor is just as important, as this is the command to create a correlation matrix. The settings are to use the df_numeric_vals data frame as the data to use with the function, the use=’all.obs’ just tells the function to use all observations in the data frame and the method=method_corrplot uses the if statement I created in step 2 to select the relevant chart from the input.  The order uses the angular ordering method and the addCoef.col=’black’ sets the coefficient values to always show black, as well as the colour of the labels. The background colour of the correlation plot uses the colorRampPalette function to create a gradient scale for the function and the parameters of each of the colour settings like to those inputs I explained in step 1.

    The full function is detailed here:

    create_gh_style_corrplot <- function(df_numeric_vals,
                                         colour_max="green") {
      if(method_corrplot == 1 ){
        type_var <- "shade"
        method_corrplot = type_var 
      else if (method_corrplot ==2) {
        type_var <- "number"
        method_corrplot = type_var
      else if (method_corrplot ==3) {
        type_var <- "pie"
        method_corrplot = type_var
      else if (method_corrplot ==4) {
        type_var <- "ellipse"
        method_corrplot = type_var
      else if (method_corrplot ==5) {
        type_var <- "circle"
        method_corrplot = type_var
        type_var <- "shade"
        method_corrplot <- type_var
    corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
             order = "AOE",
             addCoef.col = 'black',
             number.cex = 0.5,
             tl.cex = 0.6,
             tl.col = 'black',
             col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
             cl.cex = 0.3)

    If you want to use the function, just copy and paste this code into a R script file and this will create the function for you. Please remember to install the corrplot package by using install.packages(corrplot).

    Utilising the function

    The example dataset I will use here is the mpg sample file provided by ggplot. Load the R script provided towards the end of the last section first, as this will create the function in R’s environment. Next, add this code to the end to look at the various different iterations and charts that can be created from the data:

    ##------------------CREATE DATASET---------------------------------------
    numeric_df <- data.frame(mpg[c(3,5,8,9)])
    #This relates to the numeric variables in the data frame to use with my function
    ##------------------USE FUNCTION-----------------------------------------
    create_gh_style_corrplot(numeric_df,1, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,2, "steelblue2","black", "black")
    create_gh_style_corrplot(numeric_df,3, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,4, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,5, "steelblue2","white", "whitesmoke")

    The outputs of the charts are reliant on the correlation plot type select 1-5, and the colour ranges selected. You can choose any colour and I would recommend using the command colours() in R console or script to pull up the list of colours native to R.

    How about these visualisations:


    Each plot can be tailored to suite your needs. I tend to like blue shades, but go all out and choose whatever colours you like. The R source code is accessible here.

    I do hope you will use this function to maximise your correlation plots – its all about relationships!

    This blog was written by Gary Hutson, Principal Analyst, Activity & Access Team, Information & Insight at Nottingham University Hospitals NHS Trust, and was originally posted here.

  • From script-based development to function-based development and onwards to Package Based development: part 2

    At the NHS R Conference, I suggested to people that they should embrace the idea of package-based development rather than script-based work.

    In the first part of this tutorial, in the fictional NHS-R Community greeting room, our humble analyst was tasked with greeting people. Rather than writing a script and needing to repeat themselves all the time with different variations of greetings and senders, they wrote a rather nifty little function to do this:

    greet_person <- function(greeting = “Hello World”, sender = “the NHS-R Community!”) {
      if (!is.character(greeting) {
        stop(“greeting must be a string”)
      if (!is.character(sender) {
        stop(“sender must be a string”)
      if (length(sender) > 1) {
        warning(“greet_person isn’t very good at handling more than one sender. It is better to use just one sender at a time.”)
      message(greeting, “ from “, sender)

    As we know, R is awesome and many people took up R on the background of some excellent publicity and training work by the NHS-R community. Our poor greeting team got overwhelmed by work: it is decided that the team of greeters needs to expand. There will now be a team of three greeters. Every other bit of work output from our NHS-R community team will involve greeting someone before we present our other awesome analysis to them.

    This is going to be a nightmare! How can we scale our work to cope with multiple users, and multiple other pieces of work using our greeting function.

    If we rely upon the scripts, we have to trust that others will use the scripts appropriately and not edit or alter them (accidentally or on purpose). Furthermore, if someone wants to greet someone at the beginning of their piece of analysis, they’ll either have to copy the code and paste it somewhere, or link to our script containing the function – which in turn means they need to keep a common pathname for everything and hope no-one breaks the function. Nightmare!

    Fortunately, someone attended the last NHS-R conference and remembered that package-based development is a really handy way of managing to scale your R code in a sustainable way. So after a team meeting with copious caffeine, it is decided that greet_person needs to go into a new package, cunningly named NHSRGreetings. And here’s how we’re going to do it.

    In R Studio, go to File and then to New Project. Click on New Directory, and then click on R Package. I am using RStudio 1.2 Preview for this tutorial which is downloadable from the R website. I would recommend doing this as some of the package management has been greatly simplified and some frustrating steps removed.











    Click on ‘Open in new session’ (so we can see the original code), and set the Package name as NHSRGreetings. We could just pull our old source files into the package – but for this tutorial I’m going to do things the longer way so you also know how to create new functions within an existing package.

    Set the project directory to somewhere memorable.

    For now don’t worry about the git or packrat options – those are tutorials within themselves!

    You are greeted with a package more or less configured up for you. A single source file, ‘hello.R’ is set up for you within an ‘R’ directory within the package. It’s not as cool as our function of course, but it’s not bad! It comes with some very helpful commented text:

    # Hello, world!
    # This is an example function named 'hello'
    # which prints 'Hello, world!'.
    # You can learn more about package authoring with RStudio at:
    #   http://r-pkgs.had.co.nz/
    # Some useful keyboard shortcuts for package authoring:
    #   Install Package:           'Cmd + Shift + B'
    #   Check Package:             'Cmd + Shift + E'
    #   Test Package:              'Cmd + Shift + T'

    So let’s check if the comments are right – hit Cmd + Shift + B on a Mac (on Windows and Linux you should see slightly different shortcuts). You can also access these options from the Build menu in the top right pane.

    You will see the package build. R will then be restarted, and you’ll see it immediately performs the command library(NHSRGreetings) performed, which loads our newly built package.

    If you type hello() at the command line, it will do as you may expect it to do!

    [1] "Hello, world!"

    So – time to customise our blank canvas and introduce our much more refined greeter.

    In the root of our project you will see a file called DESCRIPTION. This contains all the information we need to customise our R project. Let’s customise the Title, Author, Maintainer and Descriptions for the package.

    We can now create a new R file, and save it in the R subdirectory as greet_person.R. Copy over our greet_person function. We should be able to run install and our new function will be built in as part of the package.

    We can now get individual team members to open the package, run the build once on their machine, and the package will be installed onto their machine. When they want to use any of the functions, they simply use the command library(NHSRGreetings) and the package will be ready to go with all the functions available to them. When you change the package, the authors will have to rebuild the package just the once to get access to the new features.

    When writing packages it is useful to be very wary about namespaces. One of the nice things about R is that there are thousands of packages available. The downside is that it makes it very likely that two individuals can choose the same name for a function. This makes it doubly important to pick appropriate names for things within a package.

    For example, what if instead of the NHSRCommunity package someone wrote a CheeseLoversRCommunity package with a similarly names greet_person, but it did something totally different?

    In a script, you have full control over the order you load your packages, so R will happily let you call functions from packages and trust that you know what order you loaded things in.

    If you are a package author however, it’s assumed you may be installed on many machines, each with a potentially infinite set of combinations of different packages with names that may clash (or if they don’t already they might do in the future).

    So within the package, any function which doesn’t come from R itself needs to have clearly defined which package it has come from.

    Within DESCRIPTION you must define which package you use, and the minimum version. You do this with the Imports keyword. Attached is the Imports section of one of the SSNAP packages:

        methods (>= 3.4.0),
        lubridate (>= 1.7.4),
        tibble (>= 1.4.2),
        dplyr (>= 0.7.5),
        tibbletime (>= 0.1.1),
        glue (>= 1.2.0),
        purrr (>= 0.2.5),
        rlang (>= 0.2.0),
        readr (>= 1.1.1),
        stringr (>= 1.3.1),
        ssnapinterface (>= 0.0.1)

    Next within your functions, rather than just calling the functions use the package name next to the function. For example instead of calling mutate() from the dplyr package, refer to it as dplyr::mutate() which tells R you mean the mutate function from the dplyr package rather than potentially any other package. There are ways to declare functions you are using a lot within an individual file – but this method makes things pretty foolproof.

    Another tip is to avoid the magrittr pipe within package functions. Whilst magrittr makes analysis scripts nice and clean, firstly you still have the namespace issue to deal with (%>%

    Is actually a function, just one with a funny name – it is really called magrittr::%>%() !) Secondly the way magrittr works can make debugging difficult. You don’t tend to see that from a script. But if you’re writing code in a package, which calls a function in another package, which calls code in another package, which uses magrittr – you end up with a really horrid nest of debugging errors: it is better to specify each step with a single variable which is reused.

    When you’ve got your code in, the next important thing to do is check your package. Build simply makes sure your code works. Check makes sure that you follow a lot of ‘rules’ of package making – including making sure R can safely and clearly know where every R function has come from. Check also demands that all R functions are documented: something which is outside of the scope of this tutorial and is probably the subject for another blog post – a documented function means if you type ?greet_person that you should be able to see how to use the function appropriately. It can help you create your own website for your package using the pkgdown package.

    If your package both builds and checks completely and without errors or warnings, you might want to think about allowing the wider public to use your project. To do this, you should consider submitting your project to CRAN. This involves a fairly rigorous checking process but means anyone can download and use your package.

    If we can get enough people to develop, share their code and upload their packages to CRAN we can work together to improve the use of R across our community.

    Feedback and responses to @drewhill79.

    This blog was written by Dr. Andrew Hill, Clinical Lead for Stroke at St Helens and Knowsley Teaching Hospitals.

  • Roadmap to collaborative working using R in the NHS: Part I- Workflows

    We finally have a tool that is data independent. R allows the scripting of a data science methodology that we can share and develop and which does not necessarily depend on carrying patient data with it. This means that the door to working collaboratively on solutions is now wide open. But before this happens in a truly robust way there are several hurdles that need to be overcome.  This is the first part of a series of articles in which I hope to outline the roadmap to networking our solutions in R so we can work collaboratively between Trusts to accelerate data science solutions for healthcare data.

    Standardising R Workflows in the NHS

    In the throws of a new(ish) tool such as R, everybody has a way of working which is slightly different to one another. This is how best practice evolves, but in the end the best practice has to be standardised so that developers can work together. This is an article outlining my best practice for R script organisation and is an invitation for comment to see if there is any interest in developing a NHS standard for working and organising our solutions.

    Principles of R script development

    As the R development environment is so methods based, it makes sense to have a standardised way to develop scripts so that different developers understand the basic workflow of data and can focus on the specific methodology for the specific problem rather than disentangle endless subsets of data and how each is cleaned and merged etc.  I use various principles when developing a script and useful approach to R script development.

    a) An R script should focus on a specific problem.

    A solution is only as good as the question. Questions come in a variety of scopes and shapes and there is an art to asking a solvable question which is beyond the limits of this article.

    Having defined a question, a script should set out to be the solution to that question and not be a generic answer. Generic answers belong as functions or collections of functions called packages.  An R script should tackle specific problems such as ‘How many endoscopies did we perform last year’ and you find that this kind of question is asked a lot (‘How many x did we perform last y years) then the script might become a function and a collection of functions might become a package.

    b) The R script should access the minimal dataset needed and avoid intermediate datasets.

    There is a real danger with data analysis that the data set used is huge but you only need a part of it. With R you can have the ability to specify the data used from within the script so that you should use the minimum dataset that is pertinent to the question. In fact the whole script should be specific to the question being asked. The data access should, as far as possible also be from the data repository rather than an intermediate dataset. For example you can specify a SQL query from within R to an electronic patient record (EPR) rather than get a data dump from the EPR into, for example, an Excel spreadsheet, and then import the Excel spreadsheet. It’s just more secure and avoids versioning issues with the data dump.

    c) An R script should be organised according to a standardised template

    All analysis I perform adheres to a specific workflow for each script so that the script is separated into specific subsections that perform types of actions on the data. This also incorporates the important aspect of commenting on each part of the workflow. This is important so that developers can understand the code further down the line. The workflow I use is as follows:

    ## Title ##

    ## Aim ##

    ## Libraries ##

    ## Data access ##

    ## Data cleaning & Data merging ##

    ## Data mutating (creating columns from pre-existing data) ##

    ## Data forking (filtering and subsetting) ##

    ## Documenting the sets (usually creation of consort type diagrams with diagrammeR##

    ## Documenting the code with CodeDepends ##

    ## Data analysis ##

    Data access

     The title of the script including the author and date is written at the top. The aim of the script is then stated along with any explanation (why am I doing this etc.). The workflow makes sure that all libraries are loaded at the beginning. Data access is also maintained at the top so anyone can immediately see the starting point for the analysis. Data access should specify the minimal dataset needed to answer the specific question of the script as explained above. For example there is no point using a dataset of all endoscopies between 2001 and 2011 when your script is only looking at colonoscopy specifically. I also try to avoid functions such as file.choose() as I like to keep the path to the source file documented, whether it is a local or remote repository.

    Data cleaning & Data merging

     The difficult task of data cleaning and merging with other datasets is then performed. One of my data principles is that when working with datasets you should start with the smallest dataset that answers all of your questions and then filter down for each question rather than build up a dataset throughout a script, so I like to merge external datasets early when possible. This could be called early binding technically but given the data access part of the code specifies a data set that is question-driven, I am early binding to a late-bound set (if that makes sense).

    Data mutating

     Once cleaning and merging is performed, subsets of data for specific subsections of the question can be done and then the statistical analysis performed on each subset as needed.

    I find it very useful to keep track of the subsets of data being used. This allows for sanity checking but also enables a visual overview of any data that may have been ‘lost’ along the way. I routinely use diagrammeR for this purpose which gives me consort type diagrams of my dataset flows.

    The other aspect is to examine the code documentation and for this I use codeDepends which allows me to create a flow diagram of my code (rather than the data sets). Using diagrammeR and codeDepends allows me to get an overview of my script rather than trying to debug line by line.

    d) R scripts should exist with a standardised folder structure for each script

    R scripts often exist within a project. You may be outputting image files you want access to later, as well as needing other files. R studio maintains R scripts as a project and creates a file system around each script. There are several packages that will also create a file dependency system for a script so that at the very least the organisation around the R script is easy to navigate. There are several ways to do this and some packages exist that will set this up for you. 

    e) R files should have a standard naming convention.

    This is the most frustrating problem when developing R scripts. I have a few scripts that extract text from medical reports. I also have a few scripts that do time series analysis on patients coming to have endoscopies. And again a few that draw Circos plots of patient flows through the department. In the end that is a lot of scripts. There is a danger of creating a load of folders with names like ‘endoscopyScripts’ and ‘timeSeries’  that don’t categorise my scripts according to any particular system. The inevitable result is lost scripts and repeated development. Categorization and labelling systems are so important so you can prevent re-inventing the same script. As the entire thrust of what I do with R is in the end to develop open source packages, I choose to name scripts and their folders according to the questions I am asking. The naming convention I use is as follows

    Top level folder: Name according to question domains (defined by generic dataset)

    Script name: Defined by question in the script dataset_FinalAnalysis

    The R developer will soon come to realise the question domains that are most frequently asked and I would suggest that this is used as the naming convention for top-level folders. I would avoid categorising files according to the method of analysis. As an example, I develop a lot of scripts for the extraction of data from endoscopies. In general I either do time series analysis on them or I do quality analysis. The questions I ask of the data are things like: ‘How many colonoscopies did we do last year’ or ‘How are all the endoscopists performing when measured by their diagnostic detection rates for colonoscopy’. I could name the files ‘endoscopyTimeSeries’ or ‘endoscopyQualityAssess’ but this mechanistic labelling doesn’t tell me much. By using question based labelling I can start to see patterns when looking over my files. According to my naming convention I should create a folder called ‘Endoscopy’ and then the script names should be ‘Colonoscopies_DxRate and ‘Colonoscopies_ByYear’. The next time I want to analyse a diagnostic rate, maybe for a different data set like gastroscopies, I can look through my scripts and see I have done a similar thing already and re-use it.

    In the end, the role of categorizing scripts in this way allows me to see a common pattern of questions. The re-usability of already answered questions is really the whole point of scripting solutions. Furthermore it allows the deeply satisfying creation of generic solutions which can be compiled into functions and then into packages. This has already been expanded upon here.

    f) Always use a versioning system for your R scripts

    R scripts need to be versioned as scripts may change over time. A versioning system is essential to any serious attempt at providing solutions. There are two aspects to versioning. Firstly the scripts may change and secondly the packages may change. Dependency versioning can be dealt with by using checkpoints within the scripts. This essentially freezes the dependency version so that the package that worked with the current version of the script can be used. I have found this very useful for the avoidance of script errors that are out of my control.

    The other issue is that of versioning between developers. I routinely use Github as my versioning system. This is not always available from within Trusts but there are other versioning systems that can be used in house only. Whichever is used, versioning to keep track of the latest workable scripts is essentially to prevent chaos and insanity from descending into the development environment. A further plea for open source versioning is the greater aspiration of opening up R scripts in healthcare to the wider public so that everyone can have a go at developing solutions and perhaps we can start to network solutions between trusts.


    There is a greater aim here, which I hope to expand on in a later article, which is the development of a solutions network. In all healthcare trusts, even outside of the UK, we have similar questions, albeit with different datasets. The building blocks I have outlined above are really a way of standardising in-house development using R so that we can start to share solutions between trusts and understand each other’s solutions. A major step forward in sharing solutions would be to develop a way of sharing (and by necessity categorising) the questions we have in each of our departments. Such a network of solution sharing in the NHS (and beyond) would require a CRAN (or rOpenSci) type pathway of open source building, and validation as well as standardised documentation but once this is done it would represent a step change in analytics in the NHS. The steps I have shared above certainly help me in developing solutions and certainly help in the re-use of what I have already built rather than re-creating solutions from scratch.

    This blog was written by Sebastian Zeki, Consultant Gastroenterologist at Guy’s and St Thomas’ NHS Foundation Trust.

  • From script-based development to function-based development and onwards to Package Based development

    At the NHS R Conference, I suggested to people that they should embrace the idea of package-based development rather than script-based work.

    I’m going to talk you through that process, using the simplest of scripts – ‘Hello World’. I’m going to assume that you’re using the freely available RStudio Desktop Edition as the editor for this: other versions of RStudio are likely to be essentially identical. Non R-Studio users may need to revert to more basic principles.

    First let’s write ‘Hello World’ – the simplest R script in the world. Open a new R script file and get your script underway:

    message(“Hello World from the NHS-R Community!”) Save it (for posterity).

    In the conference, we discussed that generally writing functions is more helpful than writing scripts as it gives you greater re-usability. This example is a very trivial one (in fact so trivial as to be nearly redundant).

    So we consider our script carefully and determine – what does it DO? Clearly it’s a way of greeting a person. What if we wanted to greet the person in a different way? What if we wanted to greet a different person?

    So we have defined it’s purpose, and the parameters that are likely to be useful to others.

    Let’s re-write our script to be more useable.

    We define a function using the function function. You can see a much more detailed tutorial of how to do this here: http://stat545.com/block011_write-your-own-function-01.html

    A function is defined by assigning the result of the function() function to a variable which is the function name. The parameters of function() are our new parameter names in our function.

    It is really important to name your function clearly so people know what it does. Generally use active verbs to describe intent, and a consistent naming scheme. Also choose appropriate and clear names for the parameters. So let’s call our new function greet_person, and we will call our parameters greeting and recipient.

    Our new code will look like this. Stick this into a new R script for now and run it:

    greet_person <- function(greeting, sender) {
      message(greeting, " from ", sender)

    Once you’ve run your script you can now call your function from the console:

    greet_person(“Hello World”, “the NHS-R Community!”) And of course if you want to use a different greeting we can now change our parameter value:

    greet_person(“Welcome”, “the NHS-R Community!”) So far so good.

    But – we’ve had to repeat our sender parameter. What if we know we’re usually going to use that first Hello World greeting; but we just want the option of doing something different if the situation arises?

    We can get around that by supplying default values. In the function() function we can set a value to both greeting and sender using =. Let’s set default values for greet_person:

    greet_person <- function(greeting = "Hello World", sender = "the NHS-R Community!") 
      message(greeting, " from ", sender)

    Now if you want our ‘default’ message you can just call:


    But you can customise either parameter without having to specify anything you don’t want to change:

    greet_person(sender = "Drew Hill")

    Notice how I’ve specified which parameter I’m supplying? If I don’t do that, R doesn’t know whether we want to specify the greeting, or the string (and assumes you mean the first parameter unless told otherwise). I’d strongly recommend specifying every parameter name to every function you call, unless you have a function which only has one parameter. That means if you (or someone else) changes the function later you’ve been explicit and it won’t get misunderstood.

    OK – so far so good.

    Any sort of software is best to be robust though: so what happens if we abuse our newly created function? The golden rule of creating robust code is ‘fail early, fail cleanly and fail clearly’.

    Our function is clearly designed to use strings. The good news for us is that many things can be converted to strings: so let’s say you provided a number into one of those parameters, it will work:

    greet_person(sender = 1)

    Instead of “Drew Hill” from our previous example, you’ll see the sender is “1”.

    What if you accidentally sent a vector of names? R will turn those into a concatenated string of names without spaces:

    greet_person(sender = c("Bob", "Jim"))

    Some things however certainly could break this process – so it is really important to check that you can handle the inputs you receive within a function before trying to use them.

    The first thing we need to do is to make sure we are dealing with something that can be turned into a character. We can check that by using the is.character function – which returns TRUE if a given value is TRUE, and FALSE if it is not something that can be turned into a character.

    If is.character is false, we want to stop with an error:

    greet_person <- function(greeting = "Hello World", sender = "the NHS-R Community!") {
      if (!is.character(greeting)) {
        stop("greeting must be a string")
      if (!is.character(sender)) {
        stop("sender must be a string")
      message(greeting, " from ", sender)

    We can test how this works by using NULL as a parameter: in real life this happens quite a lot as you try to pass a variable to your new function but forget to set the variable earlier on!

    greet_person(sender = NULL)

    We also know that our function actually isn’t very good at handling vectors of strings (ie where there is more than one name): it will simply shove them all together without spaces. However it works and is perfectly functional. So we have a design decision: do we want to allow that, or not? A third way might be to allow it but to use a warning – perhaps a little over the top in our example, but for complex examples that may make more sense. Whereas stop will halt the code and force you to fix your bugs, the warning() function lets the code continue but tells you to go back and do it better later. Let’s add a warning if there was more than one sender:

    greet_person <- function(greeting = "Hello World", sender = "the NHS-R Community!") {
      if (!is.character(greeting)) {
        stop("greeting must be a string")
      if (!is.character(sender)) {
        stop("sender must be a string")
      if (length(sender) > 1) {
    warning("greet_person isn't very good at handling more than one sender. It is better to use just one sender at a time.")
      message(greeting, " from ", sender)

    If we now called the function with two senders we’d be able to do so but would get politely told that it’s not a good idea:

    greet_person(sender = c("Jim", "Bob"))

    So – hopefully from this you’ve moved from having a script which would only do precisely what you wanted in a single set of circumstances, to now having a natty little function which will say greet whoever you want, with the type of greeting that you want.

    As an exercise to complete: imagine you work in the NHS-R community welcoming team. You are tasked with sending greetings from the team on a regular basis.

    You used to use a script to do this and had to remember to get the style right every time – but now you sit at your console , run your script containing your function, and greet_person() on demand.

    Your boss has come to you and urgently wants you to change your way of working. Rather than sending a greeting from the team using just a single team name, he wants you to send the individual names in the greeting from both Jim and Bob.

    Have a think about how you could change the function so that we can cope with multiple senders.

    The greetings will continue as we then think about scaling up the NHS R Community Greetings division in our next installment.

    This blog was written by:

    Dr. Andrew Hill

    Clinical Lead for Stroke, St Helens and Knowsley Teaching Hospitals

  • Installing R and R studio

    Installation Instructions (Part 1 of 2)

    Windows Users

    To Install R:

    1. Open an internet browser and go to r-project.org.
    2. Click the “download R” link in the middle of the page under “Getting Started.”
    3. Select a CRAN location (a mirror site) and click the corresponding link.
    4. Click on the “Download R for Windows” link at the top of the page.
    5. Click on the “install R for the first time” link at the top of the page.
    6. Click “Download R for Windows” and save the executable file somewhere on your computer.  Run the .exe file and follow the installation instructions.
    7. Now that R is installed, you need to download and install RStudio.

    To Install RStudio

    1. Go to rstudio.com and click on the “Download RStudio” button.
    2. Click on “Download RStudio Desktop.”
    3. Click on the version recommended for your system, or the latest Windows version, and save the executable file.  Run the .exe file and follow the installation instructions.

    Mac Users

    To Install R

    1. Open an internet browser and go to r-project.org.
    2. Click the “download R” link in the middle of the page under “Getting Started.”
    3. Select a CRAN location (a mirror site) and click the corresponding link.
    4. Click on the “Download R for (Mac) OS X” link at the top of the page.
    5. Click on the file containing the latest version of R under “Files.”
    6. Save the .pkg file, double-click it to open, and follow the installation instructions.
    7. Now that R is installed, you need to download and install RStudio.

    To Install RStudio

    1. Go to rstudio.com and click on the “Download RStudio” button.
    2. Click on “Download RStudio Desktop.”
    3. Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.


    Installation Instructions (Part 2 of 2)

    Installing Required Packages

    1.  Open Rstudio (not R) and go to the “Tools” option on the menu bar (Mac users: choose “Not Now” when asked if you want to install developer tools)
    2. Select the option; “Install Packages…”. In the text box for packages type the following (each package name separated by a space);tidyverse readxl lubridate zoo gridExtra padr

    1. Click the Install button (Mac users: again, choose “Not Now” to dev. tools).

    Text and progress bars should start appearing on your screen. If you are prompted about installing packages which need compilation, type y and press Enter.

     Note: If the warning and error messages (similar to below) appear, it may be that you do not have administrator privileges enabled. If you have administrator privileges on your machine then you may be able to address this issue with the following steps: Close RStudio, find the RStudio icon again and right click it. Select “Run as administrator”. Confirm that you want this program to make changes to the computer. Repeat the instructions above (Tools –> Install Packages… )

    Warning in install.packages :

    ‘lib = “C:/Program Files/R/R-3.4.0/library”‘ is not writable

    “Error in install.packages : unable to install packages”

    Once the installation has finished, click beside the command prompt in the Console window (it’s the arrow symbol, > , without text afterward). The cursor should be flashing.









    Type (rather than copy and paste) the following, carefully (it is case sensitive):


    At this point the autocomplete feature should provide the option “tidyverse_logo” beneath the text you are typing:




    Press Enter to select this option.

    Then press Enter to execute the line of code.

    You should see the logo (as below), meaning you have successfully installed the tidyverse!