• NHS-R Member Profile – Robin Hinks

    R is a powerful tool for manipulating health and care data and a lot can be learned from sharing our experiences of using R with others. We bring to you an NHS-R profile from one of our Community members, to share their insider knowledge of using R…

    Robin Hinks
    Research and Policy Officer
    Chartered Society of Physiotherapy

    How did you first meet R?
    While working as a civil service analyst, where I was encouraged to undertake self-directed learning in r to conduct statistical and geographic analysis.
    What sort of things do you use R for and what do you love about R?
    Through previous roles – where I have done quality assurance and validation of other research teams’ work – I know the value of well-documented analytical process, and the dangers of poor record keeping! I love how r allows you to keep all your analysis, research notes and – through r Markdown – reporting in one place.
    What do you hate about R?
    I have a qualitative research background and using r has been my first real exposure to code development. While I found the move from, say, SPSS’ ‘point and click’ environment easy enough, I have found it difficult to get my head round the wider world of code development that surrounds r: learning about pulls, commits, splits and the like has been challenging!
    What are your top tips for using R?
    Start by coding up some tasks you already have a process for elsewhere – e.g. automating some data transformations you’ve previously done in a programme like SPSS or Excel. Working out how to translate a task into r’s environment is a lot easier that starting from a blank slate.
    Can you please name a project where you have used R? Briefly describe what this involves.
    Health data and statistics are reported against a range of geographies that do not easily match up with the political geographies our members might seek to influence – e.g. parliamentary constituencies or local authorities. I’ve used r to develop look up tables between different geographic areas; and using the leaflet package visually map different geographic area,
    developing simple choropleth and point maps for internal insight work.

  • Annotating SPC plots using annotate with ggplot

    Annotating SPC plots using annotate with ggplot

    Statistical Process Control (SPC) charts are widely used in healthcare analytics to examine how the metric varies over time and whether this variation is abnormal. Christopher Reading has already published a blog on SPC Charts and you also can read more about SPC charts here or here.

    Here is a simple example of annotting points and text on SPC plots using ggplot2 package. We won’t explain all the parameters in the annotate function. Instead we see this as a short show and tell piece with signsposts at end of the blog.

    So let’s get started and generate some dummy data from a normal distribution with a mean of 0 and and a standard deviation of 1.

    set.seed(2020) # set the random number seed to ensure you can replicate this example
    y <- rnorm(30, 0, 1) # generate 30 random numbers for the y-axis
    y <- c(y, rep(NA, 10)) # add 10 NA's to extend the plot (see later)
    x <- 1: length(y) # generate the x-axis
    df <- tibble(x=x, y=y) # store as a tibble data frame for convenience

    Now we can plot the data using ggplot function.

    fig1 <-  
      ggplot(df, aes(x,y)) +
      geom_point(size=2) +
      geom_line() +
      ylim(-4,4) # increase the y-axis range to aid visualisation
    fig1 # plot the data

    One of the main features of SPC charts are upper and lower control limits. We can now plot this as an SPC chart with lower and upper control limits set at 3 standard deviations from the mean. Although in practice the calculation of control limits differs from this demo, for simplicity we imply control limits and a mean as set numbers. Alternatively, you could use qicharts2 package to do SPC calculations and then use the generated ggplot2 object and keep following our steps.

    fig1 <- 
      fig1 +
      geom_hline(yintercept = c(3,0,-3), linetype='dashed') +  # adds the upper, mean and lower lines
      annotate("label", x=c(35,35,35), y=c(3,0,-3), color='darkgrey', label= c("Upper control limit", "Average line", "Lower control limit"), size=3) # adds the annotations
    fig1 # plot the SPC

    Remarkably we see a point below the lower control limit even though the data are purely pseudo-random. A nice reminder that control limits are guidelines not hard and fast tests of non-randomness. We can now highlight this remarkable special cause data point which is clearly a false signal also known as special cause variation.

    fig1 <- 
      fig1 +
      annotate("point", x=18, y=df$y[18], color='orange', size=4) + 
      annotate("point", x=18, y=df$y[18])
    fig1 # plot the SPC with annotations

    We can now add a label for the special cause data point. You can play around with the vjust value (eg try -1, 0, 1) to get a feel for what it is doing to the vertical position of the label. There is also a hjust which operates on the horizontal plane.

    fig1 <- 
      fig1 +
      annotate("label", x=18, y=df$y[18], vjust=1.5, label = "Special cause variation", size=3)
    fig1 # plot the SPC with more annotations

    To learn more about the annotate function see # #https://ggplot2.tidyverse.org/reference/annotate.html #https://www.gl-li.com/2017/08/18/place-text-at-right-location/

  • Forecasting R (Virtual) Workshop – the pupil’s perspective

    I saw on the NHS-R Community slack channel that the much sought after Forecasting R Workshop led by Bahman Rostami-Tabar was moving to a virtual environment whilst most of the country was shut due to the COVID pandemic. I had previously read the “Forecasting: Principles and Practice” online textbook from cover to cover and learnt the ‘motions’ of time series forecasting but hadn’t really understood the ‘why’. I had used previously black-box style models (Prophet), and understood how to make it work rather than why it works. I was looking forward to the course cementing some of the fundamentals in theory, and then refine the practice in the software.

    It delivered.

    Looking back through the 12 pages of notes I took to go alongside the excellent content produced by Bahman, we covered a huge amount in 6 short, two hour sessions.

    Before even picking up a dataset however, we talked through some of the key factors behind forecasting – the high level process, when we might want to or when we might not want to do it, and in which context it works best.

    A screenshot of a cell phone

Description automatically generated
    The time series forecasting process – prep data, visualise, specify a model, model estimation, fit model, evaluate, forecast output

    Another key item before diving in that we covered were some of the key ‘glossary’ terms that are used, such as the difference between frequency (the time intervals between points in your prediction) and horizon (how far into the future you are going to make your prediction). For some, this might be bread and butter, but adding structure to the language of forecasting really helped me get it clear in my head.

    We spent a good share of the sessions looking at the theory of forecasting, and intermingled this with sessions in R. We moved through baseline predictions using simple average, naïve, seasonal naïve and drift methods (and their corresponding functions) through to exponential smoothing, auto-regressive integrated moving average (ARIMA) and regression.

    Baseline models, what they do, and how to do it in R

    We also looked at several measures within the forecasting toolbox, such as residual diagnostics and forecasting errors. It was going through these that asking Bahman to pause and explain in a bit more detail some of these concepts that really set an instructor led course in a league above trying to work through a textbook, re-reading the same cryptic passage and still being none the wiser.

    My key takeaway from the learning was that R makes it really easy to have a hands off approach to modelling (without a calculator in sight!), but picking apart some of the automation meant I was able to convey what was happening back to colleagues from a much better informed perspective.

    My mind map summary of my key takeaways

    The virtual environment worked well from a pupil’s perspective, with some caveats. First, I found that reading the pre-session work was vital. Although Bahman was very open to going back over content, the pace at which we had to move was quick due to time constraints, so having a sense of the content beforehand really helped slot things in to place.

    Alongside the reading, it was so important to have the lab scripts prepped and ready to go from day one. The NHSR Community team did a great job of getting the projects set up on RStudio Cloud, which meant that everything was ready to go, but if (like me) you wanted to get it running locally on your machine spending the time making sure your setup was working as expected prior to the first session was vital.

    On the whole, I found the course both hugely enjoyable and informative. I am looking forward to integrating all that I learnt into my role as a demand and capacity analyst, finally feeling like I am coming from a perspective of understanding as to what methods to use and how they work. The instructor-led approach meant that I could finally get to grips with ideas that had been mystifying from just reading a textbook, for which I would like to extend my thanks to Bahman, Tom and Alysia for their work on running the course.

  • We need you: share your COVID-19 work on NHSE/I’s regular mini-huddle series

    COVID-19 has placed the value of data and analytics at the fore. This has meant an unprecedented focus on the work of health and care analysts, but also huge demand for local analytical services. The need to share and share alike is paramount.  To support this, NHSE/I are holding a series of huddles and mini-huddles to allow analysts to share their COVID-19 work with a wider community.

    Weekly huddles showcase work on a specific topic, and the back catalogue of recordings can be found here. The mini-huddles are a spin off series where speakers get into the detail behind a single topic, like geo-mapping or a dashboard. The team curating these huddles are keen to have a mini-huddle on COVID-19 analysis in R. If you are interested in sharing your analysis relating to COVID-19 using R, please get in touch with Sophie Hodges via email sophie.hodges5@nhs.net

     If you haven’t already, please sign up to the COVID-19 data and analytics workspace here. The community is currently 6500-strong with members across the NHS, public health, local authorities, charities and commercial organisations. Mini-huddles usually draw an audience between 30 and 80 attendees so it’s a great opportunity to share your work, discuss findings and network with others working on similar issues.

    Please get in touch if this sounds of interest. And a huge thank you from NHSE/I. The work being done by the analytical community is a key plank of the COVID-19 effort.

  • phsmethods: an R package for Public Health Scotland

    by Jack Hannah


    On Wednesday 1st April 2020, Public Health Scotland (PHS) came into being. It was formed as a result of a three-way merger between NHS Health Scotland and the Information Services Division (ISD) and Health Protection Scotland (HPS) sections of Public Health and Intelligence (PHI) (which was in itself a strategic business unit of NHS National Services Scotland (NSS)). It’s fewer acronyms to remember at least.

    The Transforming Publishing Programme (TPP) was devised in 2017 in an attempt to modernise the way in which ISD produced and presented its statistics. Traditionally, work within ISD had been undertaken using proprietary software such as Excel and SPSS, with output presented predominantly in the form of PDF reports. It required a great deal of manual formatting, caused an even greater deal of frustration for analysts with programming skills, and resulted in more copy and paste errors than anyone would care to admit.

    Over time, ISD gradually (and at times begrudgingly) came to embrace open source software – predominantly R and, to a lesser extent, Python. TPP were at the forefront of much of this: creating style guides; working with producers of official and national statistics to convert PDF reports into Shiny applications; producing R packages and Reproducible Analytical Pipelines; and introducing version control, among other things. Now, with the move to PHS complete, TPP’s purview has broadened: not only to further the adoption of R, but to ensure it’s adopted in a consistent manner by the hundreds of analysts PHS employs.

    Introducing phsmethods

    Analysts working across a multitude of teams within PHS have to perform many of the same, repetitive tasks, not all of which are catered for by existing R packages: assigning dates to financial years in YYYY/YY format (e.g. 2016/17); formatting improperly recorded postcodes; returning quarters in plain English (e.g January to March 2020 or Jan-Mar 2020) rather than the slightly restrictive formats offered by lubridate::quarter() and zoo::yearqtr(). The list is endless, and every analyst has their own workaround, which quickly becomes a problem. Time is wasted by multiple people devising an alternative way of doing the same thing – and how can anyone be sure that everyone’s method actually does the same thing?

    The phsmethods package was conceived to address (some of) these concerns. At the time of writing, it contains eleven functions to facilitate some of the more laborious tasks analysts regularly face, with more in the works. None deal with any statistical methodology, nor are they likely to provoke any controversy or consternation over their methods of implementation; they are simple functions designed to make routine data manipulation easier.

    phsmethods isn’t on CRAN. Perhaps it will be at some point, but there hasn’t been any need to make it available more widely thus far. phsmethods does, however, come with many of the features one would expect from a CRAN package: function documentationunit testscontinuous integrationcode coverage; and releases. No hex sticker yet, but maybe one day.

    Using phsmethods

    Practical examples featuring all of phsmethods’ functions are available in the package’s README and no one will make it to the end of this blogpost if they’re all regurgitated here. But hopefully one demonstration is okay. Consider the following fictitious dataset containing postcodes:

    df <- tibble::tribble(
      ~patient_id, ~postcode,
      1,           "G26QE",
      2,           "KA8 9NB",
      3,           "PA152TY ",
      4,           "G 4 2 9 B A",
      5,           "g207al",
      6,           "Dg98bS",
      7,           "DD37J    y",
      8,           "make",
      9,           "tiny",
      10,          "changes"
    ); df
    ## # A tibble: 10 x 2
    ##    patient_id postcode     
    ##         <dbl> <chr>        
    ##  1          1 "G26QE"      
    ##  2          2 "KA8 9NB"    
    ##  3          3 "PA152TY "   
    ##  4          4 "G 4 2 9 B A"
    ##  5          5 "g207al"     
    ##  6          6 "Dg98bS"     
    ##  7          7 "DD37J    y" 
    ##  8          8 "make"       
    ##  9          9 "tiny"       
    ## 10         10 "changes"

    This is a problem that analysts across the NHS and beyond face regularly: a dataset containing postcodes arrives, but the postcodes are recorded inconsistently. Some have spaces; some don’t. Some are all upper case; some are all lower case; some are a combination of both. And some aren’t real postcodes. Often this dataset is to be joined with a larger postcode directory file to obtain some other piece of information (such as a measure of deprivation). Joining these datasets tends not to be a trivial task; sometimes a fuzzyjoin may suffice, but often it requires an arduous process of formatting the postcode variable in one dataset (or both datasets) until they look the same, before combining using a dplyr join.

    In PHS, postcodes typically follow one of two formats: pc7 format (all values have a string length of seven, with zero, one or two spaces before the last three digits as necessary); or pc8 format (all values have one space before the last three digits, regardless of the total number of digits). phsmethods::postcode() is designed to format any valid postcode, regardless of how it was originally recorded, into either of these formats. Consider the earlier dataset:

    df %>%
      dplyr::mutate(postcode = phsmethods::postcode(postcode, format = "pc8"))
    ## # A tibble: 10 x 2
    ##    patient_id postcode
    ##         <dbl> <chr>   
    ##  1          1 G2 6QE  
    ##  2          2 KA8 9NB 
    ##  3          3 PA15 2TY
    ##  4          4 G42 9BA 
    ##  5          5 G20 7AL 
    ##  6          6 DG9 8BS 
    ##  7          7 DD3 7JY 
    ##  8          8 <NA>    
    ##  9          9 <NA>    
    ## 10         10 <NA>
    df %>%
      # The format is pc7 by default
      dplyr::mutate(postcode = phsmethods::postcode(postcode)) %>%
      dplyr::pull(postcode) %>%
    ##  [1]  7  7  7  7  7  7  7 NA NA NA

    In hindsight, maybe it should have been called something other than postcode() to avoid the whole postcode = postcode(postcode) bit, but everyone makes mistakes when it comes to naming functions.

    phsmethods::postcode() isn’t designed to check whether a postcode actually exists; it just checks whether the input provided is a sequence of letters and numbers in the right order and of the right length to theoretically be one and, if so, formats it appropriately. If not, it returns an NA. Handily, phsmethods::postcode() comes with some warning messages to explain how many values were recorded as NA, why they were recorded as NA, and what happens to lowercase letters:

    warnings <- testthat::capture_warnings(df %>% dplyr::mutate(postcode = phsmethods::postcode(postcode)))
    ## 3 non-NA input values do not adhere to the standard UK postcode format (with or without spaces) and will be coded as NA. The standard format is:
    ## • 1 or 2 letters, followed by
    ## • 1 number, followed by
    ## • 1 optional letter or number, followed by
    ## • 1 number, followed by
    ## • 2 letters
    ## Lower case letters in any input value(s) adhering to the standard UK postcode format will be converted to upper case

    Next steps

    The seven functions which comprised the 0.1.0 release of phsmethods were all written by members of TPP. However, the package is not intended to be a vanity exercise for the team. Those initial functions were designed to get the package off the ground, but now contributions are open to be made by anyone in PHS. With that in mind, the 0.2.0 release contains four functions written by PHS analysts not part of TPP.

    Unlike many R packages, phsmethods will, at any given time, have two or three maintainers, each with an equitable say in the direction the package takes. When one maintainer moves on or gets fed up, someone else will be sourced from the pool of analysts, and the show will go on. In theory at least.

    There are moderately extensive guidelines for anyone who wishes to contribute to phsmethods to follow. Proposed contributions should be submitted as GitHub issues for the maintainers to consider and discuss. If approved, the contributor makes a branch and submits a pull request for one or more of the maintainers to review. The hope is that, by creating an issue prior to making a contribution, no one will devote time and energy to something which can’t be accepted, and that no duplication of effort will occur in the case of multiple people having the same idea.

    It’s not expected that everyone who wishes to contribute to phsmethods will understand, or have experience of, all of the nuances of package development. Many won’t have used version control; most won’t have written unit tests or used continuous integration before. That’s okay though. The only requirement for someone to contribute code to phsmethods is that they should know how to write an R function. The package maintainers are there to help them learn the rest.

    Hopefully phsmethods will provide a safe incubator for analysts who wish to develop their skills in writing, documenting and testing R functions prior to making their own open source contributions elsewhere. And hopefully the knock-on effect of more analysts developing these skills will be improved coding standards permeating throughout PHS. If not, it’ll at least look good having a package pinned on my GitHub profile.

  • SPC Charting in R

    Divisional Information Specialist, Worcestershire Acute Hospitals NHS Trust

    By Christopher Reading

    For some time the Information Department at Worcestershire Acute NHS Trust has been making use of statistical process control (SPC) charts for internal reporting purposes. This has formed part of our drive toward better decision-making (as recommended in NHSI’s Making Data Count https://improvement.nhs.uk/resources/making-data-count/).

    In doing so we have made extensive use of NHSI’s Excel-based SPC tool and have also sought to implement this methodology outside of the confines of MS Excel (ie. within our SQL/SSRS based reporting suite).

    As the Department’s unofficial ‘R Champion’, I have driven efforts to increase my team’s knowledge and usage of R over the last six months. My experience with NHSI’s resources suggested that R offered a route to more efficient, consistent and quickly reproducible SPC charting. I set about developing a charting function within R which would replicate NHSI’s logic and methodology[1].

    I developed and tested a custom function in R which requires two primary arguments: a series of data points, and a set of accompanying date values. The function then creates a data frame containing the data series, its mean and moving average values, and upper and lower control limits. The series is then tested against NHSI’s methodology, and special cause variations are highlighted and colour coded. This formatting is done according to a secondary function argument which identifies whether an increase or decrease in the series indicates system improvement. This data frame is then visualised using ggplot, which displays the SPC and any additional information such as a performance trajectory or national target.

    I then tested the function and compared against our existing SPC reporting. A few logical ‘gremlins’ in the function were identified and subsequently removed, and once I was happy with the function it was integrated into a growing departmental R package (currently only internally available) for use in R Markdown reporting and our expanding R Shiny dashboard repertoire.

    My next step was to use Shiny to create an SPC Wizard app, to enable colleagues without R knowledge to test and utilise the SPC function. The app allows users to supply CSV files containing multiple data series, and generate SPC charts with little or no effort. These can then be exported as image files for Trust reporting. The app allows users to make formatting changes to the chart such as customising main and axis titles, customising the frequency of axis labels and size of point and line geoms (chart objects) for lengthy data series. It also allows users to specify multiple data series at a time to create ‘small multiple’ SPC charts for simultaneous analysis.

    The project provided an excellent challenge in developing my Shiny skills, and provided an opportunity to utilise the visually impressive and professional appearance of the ShinyDashboard package. Development of this Shiny app also led to a challenging project of setting up a Linux based Shiny server, to allow hosting of the app for colleagues to use.

    A key advantage of this function-based approach is that the SPC methodology is now available for use by all analysts within the Department, and can be implemented with a minimum of coding. One of the primary difficulties with SQL based SPC logic encountered by our team was the length of code required to produce the chart data, and therefore the increased risk of error when recycling this code for different reports. The simplicity and self-contained nature of the SPC function avoids this.

    Having successfully tested and embedded the SPC function within an ad-hoc SPC wizard, I have continued to develop a Shiny Performance Dashboard for Cancer KPIs. This rapidly produces SPC charting for 2-Week-Wait Referral and 62-Day Cancer Treatment metrics from live data pulled from our SQL warehouse. I hope this will be the first of many dashboards to take advantage of an easily available and consistent SPC methodology, allowing our Department to create reports and dashboards which are better able to communicate the nature of changing time series to Trust decision-makers, and to track and evaluate the impact of operational management decisions.

    Despite the (at times steep!) learning curve involved, from creating the initial function and replicating NHSI’s SPC logic, to setting up the Shiny server and deploying apps for use, this project has been an excellent way to develop my R skills and to demonstrate the value in embedding use of R within our organisation, and making it part of our toolkit for ‘business as usual’ analysis.

    I hope that next steps for this project will be sharing our methodology with other NHS organisations, to allow further input and development of the methodology and reporting applications. Recently there have been discussions around a collaboration with other NHS Trusts and the Strategy Unit, regarding the possibility of developing an SPC package and shiny app to be available to all NHS organisations. If you would like to learn more or take part in the discussions, please join us on the NHS-R community slack channel (nhsrcommunity.slack.com) and let us know your thoughts on an SPC package, and what you might want to see as part of it!

    [1] For those not familiar with the Making Data Count resources, the SPC tool is based around a moving average measurement of sigma and significant variations in data based on the this value. These include the identification of any data points above or below three sigma; sequences of consecutive data points above/below the mean; runs of consecutively increasing/decreasing data points; and two out of three data points at greater (or less than) than 2 sigma.

  • Our first ever NHS-R webinar!

    We ran our first ever NHS-R webinar on Wednesday 19th February and it went down a storm! Chris Mainey facilitated a session on Database Connections in R which was attended by a peak audience of 72 individuals.

    The webinar began with some mentimeter questions to get to know more about who was on the line. After an icebreaker question revealed that the majority of people’s favourite TV series is Breaking Bad, we found out that (of those who answered) approximately 25% were Analysts, 40% were Senior Analysts, 15% were Intelligence/Analytical Leads and 20% were some other role. There was also a broad selection of organisations represented on the line, including approximately 30% from provider trusts, 20% from CSUs, 15% from local authorities and 5% from CCGs. Just over 30% were from other organisations and it would be interesting to delve more deeply into where these individuals are from.

    We also asked about people’s experience of the topic and what they were hoping to get out of the session. When asked what the current level of understanding around database connections in R on the line was, the average score of those who answered was 1.4/5, suggesting that this was a relatively new topic for most individuals. Moreover, regarding what individuals wanted to get out the session, people wanted to: gain a basic understanding of how to make database connections in R, how to make SQL connections, write temp tables and learn tips, tricks and best practice on how database connections can be made.

    Chris then began, explaining the fundamental elements of SQL (Structured Query Language) before highlighting the two common methods for creating database connections – the RODBC package and DBI system. Both can be used to create a connection object which can be used to manipulate or transfer data into R.

    Chris firstly went into more detail about the RODBC package, showing code for creating connections. He then explored DBI in more detail, including: making connections, how SQL can be used with DBI, writing to databases, using tables in the database, constructing dplyr queries and using SQL and returning data into R. He ended his webinar by taking us through an example script in R, which was a great way of putting the learning in context, before giving participants the opportunity to ask more specific questions about database connections in R.

    We obtained some fantastic feedback about the webinar. The top 3 words that participants used to describe the webinar were “useful”, “interesting” and “clear”. Moreover, the average rank that participants gave for: satisfaction with the webinar, whether they would recommend to others, relevance in helping them to achieve work goals and increasing their understanding of database connections in R was between 4-5 out of 5.

    We also wanted to understand what other webinar topics participants may be interested in, in future. Shiny, RMarkdown, ggplot2 and time series analysis were some of the most popular suggestions.

    Finally, we would like to thank Chris Mainey for doing a fantastic webinar and to our participants for tuning in, being engaged and asking some great questions! The material from the webinar is available on GitHub here and a recording of the webinar can be found on our NHS-R Community website here.

    Moreover, we are planning to run topical webinars on the third Wednesday of each month between 1-2pm. Our next webinar is taking place on Wednesday 18th March from 1-2pm on “Functional programming with Purr” led by Tom Jemmett. Keep your eyes peeled on our twitter page and website for the link to sign up.

     If you are interested in being part of the NHS-R community, please join our slack channel here to keep up to date with the goings on tips and tricks that people have for using R in health and care. Use the #webinars channel to suggest topics for future webinar sessions, or to put your name forward to run your own!

  • Towards open health analytics: our guide to sharing code safely on GitHub

    This post was originally published on medium.com but has kindly been re-posted to the NHS-R community blog.

    In September 2019, our team at the Health Foundation took a big step towards making our analytics more transparent and reproducible: we launched our GitHub Page and by default, made all our code repositories public. Since then, we’ve uploaded more code, welcomed contributions from external collaborators, and added new projects in progress. And another team at the Health Foundation, the Improvement Analytics Unit, has joined us by opening up their GitHub profile too. Along the journey we’ve learned a great deal and so, we’d like to share our experiences to inspire others.

    Launching our GitHub page is part of a wider strategy to make our work more open and collaborative, with the twin aims of tackling bigger issues in analytics and sharing learning. We hope this will help accelerate innovation in analytics and data-driven technologies across the health and care ecosystem.

    Building on others’ work

    As a team, we are lucky to stand on the shoulders of open data science giants. What we’ve achieved so far has only been possible because we can access an incredible range of open-source tools and the support of individuals and communities, who have done lots of hard work already. We owe much of our progress to resources and advice from the NHS-R community, the Turing Way and the wider open data science community on Twitter and Stack Overflow.

    We know this is only the start of our journey towards open analytics, but we’ve been encouraged by the positivity of the data community. We’ve had many inspiring conversations with health care analysts around the UK and have lots of ideas about where we might head next.

    Sharing our insights

    Sharing code is rapidly becoming the norm in many research fields. For example, NHSX — the unit leading on digital transformation in the NHS — has made open code and open-source technology part of its core strategy to create better technology.

    Yet, organisational and cultural barriers still exist for many analysts in health and care, and many struggle to access open-source tools and training. This might keep the NHS from realising the full potential of its data and its analytical workforce.

    To encourage others, we want to share how we got started on GitHub and now routinely share our code. This blog will cover:

    The advantages of sharing code

    Working transparently increasingly enables others to scrutinise, validate and use our work, which might also reduce duplicated efforts (‘analytical waste’). Though some of our projects still contain code for closed source software and might not be reusable as is, sharing it creates a permanent record of the analysis and can be a useful starting point for others. And while it has been hard work to get to this point, we have also noticed some unexpected benefits to us as analysts and to how we work as a team.

    It is now much easier to share our work internally, keep up with each other’s progress, and celebrate milestones and coding successes. In addition, knowing that our code is likely to be seen by others has given us fresh motivation to improve suboptimal coding and project management habits (we all have them!) and fired our enthusiasm to present our work and share our skills and knowledge.

    Another great feature of our public ‘code shop window’ is that both internal and external contributors are now automatically and visibly credited for their work.

    Setting up a GitHub profile: a checklist

    When you work with sensitive health care data, sharing code has particular challenges, which we will explore later. However, it’s too easy to get stuck figuring out the basics before you even get to that point.

    To help you get up and running, here’s a checklist similar to ours:

    ✔️ Figure out what type of GitHub account you need. As a team, we needed a paid organisation account. By default, these come with five members included and on top of this, you can add external collaborators to public repositories. Remember, there are discounts for non-profit organisations and academic institutions.

    ✔️ Get buy-in from stakeholders and decision-makers. If you need advice,check out the arguments for collaboration and pooling of analytical resources from The Strategy Unit’s draft Analysts’ Manifesto and our recent report on investing in health and care analytics. What also helped us build a case was gathering advice and agreement on the open-source licensing options, the account budget and set-up, any website copy and our plans to share the news about the launch.

    ✔️ Decide on an open-source code license. All code shared on GitHub is protected by copyright law by default, meaning that the creator retains exclusive rights. In order for others to use or modify the code, you must add a licence specifying what they can or cannot do with it. Ideally, this will also protect developers and contributors from litigation by limiting liability and warranty. We found Karl Broman’s R package primerHadley Wickham’s R packages and tldrlegal.com useful to learn about the different licensing options. We also chose the MIT Licence for our current projects because we want our code to be widely and easily used by others. However, a different licence might be a better fit for you and your team and might depend on the needs of your organisation.

    ✔️ Create a GitHub account.

    ✔️ Build navigable and user-friendly project templates. The README is the first file in the project someone will read and should cover what the code does, why it is worth looking at, and serve as a manual for the code. README files are written in markdown syntax, which is very easy to pick up if you don’t already use it. Stick to a consistent README structure across projects to make things easier for yourself and others. This README template has worked well for us so far.

    ✔️ Create repositories and start committing code. Before we went public, we found it easiest to start by adding existing code to private repositories. Once we were comfortable, we shared work-in-progress code. For R users, Jenny Bryan’s Happy Git and GitHub for the useR is a good place to get started.

    Optional: Give your account a public face with a GitHub page

    GitHub also provides organisations with the option to host a free public website, which can introduce your team, give an overview of the work or even describe and link to individual projects. Depending on the level of customisation, setting up a GitHub page might require basic knowledge of CSS and HTML (or alternatively, advanced googling skills), but the results are certainly worth it. For us, this involved the following steps:

    ✔️ Design the content and write the copy.

    ✔️ Decide on a theme. There are a number of supported themes to choose from. We used Steve Smith’s Minimal theme and customised it according to our needs.

    ✔️ Create a repository. Call it ‘username.github.io’ and follow GitHub’s guidance to add a theme and publish the website.

    ✔️ Add contents to the README.md file in markdown syntax.

    For inspiration, visit our team’s GitHub page and the repository containing the source code.

    Once we had the basic infrastructure in place, we also needed content for our repositories — the code! As we often work with confidential health care information, a priority was to develop processes that enabled us to share code while maintaining data security.

    Keeping sensitive data safe

    Much of our in-depth analytics work relies on pseudonymised patient data, which we hold and access exclusively on our secure server. Any outputs, for example statistical results or code, are checked before leaving the secure server to ensure they do not contain any confidential information or could be used to re-identify patients.

    This process, also known as Statistical Disclosure Control (SDC), is performed by colleagues who are not involved in the project themselves and who have additional training in data safety. There are comprehensive guidelines on how perform disclosure checks on statistical results, which were developed by a group of Safe Data Access Professionals. Depending on the complexity of the analysis, this process can take a fair amount of time.

    Principles (and our favourite R tools) for sharing code safely

    In addition to our robust guidelines on preparing statistical results for disclosure checks, we have also developed the following best practice methods to safely open up our code:

    1. Keep server architecture confidential: don’t release absolute file paths

    To minimise any security risks to our secure data server, we make sure our code does not contain any absolute file paths, for example to access raw data. First, we figured these wouldn’t be useful to anyone else and second, it meant we could avoid disclosing the layout of our secure server.

    In practice, we had to find a convenient way to refer to file locations in our scripts without using explicit paths. Unfortunately, this wasn’t as easy as setting a default working directory through an R Studio projects (although we would recommend using them anyway), as raw data is often stored in a separate location. We are now using a combination of two approaches:

    • The here package 📦 to refer to files within the project directory. This is preferable to using relative file paths, as it works from within subdirectories and because the file paths will be generated correctly on other operating systems.
    • A separate script containing absolute paths to locations outside the project directory, which can be sourced but is never released from our secure server.

    2. Keep code clean: don’t refer to raw data or statistical results

    Potentially the biggest confidentiality risk when releasing code is referring to either the raw data or aggregate results in the comments. Although generally considered bad practice, it is easy to fall into this bad habit and to copy-paste any outputs into a comment while working on a particularly tricky bit of analysis. Once disclosive comments are in the code, it’s easy to overlook them later.

    We believe that sticking to good habits right from the start is the best way to manage this risk. Knowing that the code will definitely go through Statistical Disclosure Control has helped us to think twice about what we put in the comments. This, in turn, helps our output checkers as they don’t have to spend as much time going through the code.

    As a result, it’s also motivated us to write more concise, readable and self-explanatory code. A tool that we can highly recommend here is the tidylog package 📦. Simply loading the package will print feedback on dplyr and tidyr operations in the R console. In this way, it often eliminates the need for hard-coded checks of these operations.

    3. Be kind to colleagues: minimise friction during disclosure control

    Releasing code on a regular basis could increase the time our colleagues spend on Statistical Disclosure Control and their overall workload. To minimise this risk, we make reading and checking our code as straightforward as possible. This, of course, starts with the point we made in the last section (no disclosive comments!), but there are several other things we do:

    • consistent and logical structure for individual scripts and for projects; including splitting analysis workflows into smaller portions, rather than working in one enormous script.
    • Increasing readability through consistent formatting and indentation. We found the tidyverse style guide very useful, as well as the Collin Gillespie’s guide to RStudio features that can automatically reformat and indent code. We are also experimenting with tools such as the styler package 📦 that can flexibly re-format whole scripts or projects.

    We are still working on finding the right balance between the added value of openness, by frequently releasing and sharing code in active development, and the workload this creates for our output checkers. At the very least, we’d like to partially automate this process in the future, and we’d welcome any suggestions for existing tools that could help us.

    Next steps

    While it has been a great experience so far, we have a long list of things left to figure out. These include:

    • How to best version-control clinical code lists, such as Read or ICD-10 codes to identify a diagnosis of interest, and reference files, such as lookup tables for geographies. Once published, databases such as the University of Manchester’s ClinicalCodes repository are a good option, but we also want to keep track of them internally. In the future, we might wrap these up into data R packages, rather than using CSV files.
    • How to encourage and manage contributions from the wider community. A first step is to add guidelines for external contributors and a code of conductto each of our repositories.
    • How to find a balance between ensuring that each analysis is self-contained (everything in one place) and avoiding unnecessary duplication of code. Part of the solution will be to move from isolated scripts towards reusable, generalisable elements and to automate some data cleaning and routine analyses (such as the pipeline for Hospital Episode Statistics data we are currently working on).
    • How to capture the computational environment, as well as the code, to move towards better reproducibility.

    We will continue to share our progress. In the meantime, let us know what you think about our GitHub profile and get in touch with your approaches, experiences and the challenges you have faced in sharing code.

    This blog was written jointly with Karen Hodgson (@KarenHodgePodge) and Sarah Deeny (@SarahDeeny). All three of us are part of the Data Analytics team at the Health Foundation.

  • NHSRdatasets meets runcharter

    This post was originally published on johnmackintosh.com but has kindly been re-posted to the NHS-R community blog.


    The NHSRdatasets package made it to CRAN recently, and as it is designed for use by NHS data analysts, and I am an NHS data analyst, let’s take a look at it. Thanks to Chris Mainey and Tom Jemmett for getting this together.

    Load packages and data

    As above let’s load what we need for this session. The runcharter package is built using data.table, but I’m using dplyr in this main section to show that you don’t need to know data.table to use it.

    library(runcharter) # remotes::install_github("johnmackintosh/runcharter)

    However- seriously, do take a look at data.table. It’s not as hard to understand as some might have you believe. A little bit of effort pays off. You can load the runcharter package from github using the remotes package. (I’ve managed to install it on Windows and Ubuntu. Mac OS? No idea, I have no experience of that).

    ae <- data("ae_attendances") # a promise
    ae <- ae_attendances #  a string
    rm(ae_attendances) # just typing 'ae' brings it to life in the environment

    That felt a bit glitchy. There has to be a sleeker way to load and assign a built in dataset but I couldn’t find one. Cursory google here.

    Let’s have a look at the data:

    ## Observations: 12,765
    ## Variables: 6
    ## $ period      <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, 2...
    ## $ org_code    <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, RQM...
    ## $ type        <fct> 1, 2, other, 1, 2, other, other, other, 1, other, ...
    ## $ attendances <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562, ...
    ## $ breaches    <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 1322...
    ## $ admissions  <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, 0,...

    Lot’s of factors. I’m actually very grateful for this package, as it caused me major issues when I first tried to plot this data using an earlier version of runcharter. I hadn’t considered factors as a possible grouping variable, which was a pretty big miss, as all the facets were out of order. All sorted now.

    There’s way too much data for my tiny laptop screen, so I will filter the data for type 1 departments – the package help gives us a great link to explain what this means

    type1 <- ae %>% 
      filter(type == 1) %>% 
    # plot attendances
    p <- runcharter(type1,
                    med_rows = 13, # median of first 13 points
                    runlength = 9, # find a run of 9 consecutive points
                    direction = "above", # find run above the original median
                    datecol = "period",
                    grpvar = "org_code",
                    yval = "attendances")

    The runcharter function returns both a plot, and a data.table/ data.frame showing a summary of any runs in the desired direction (I’m assuming folk reading this have a passing knowledge of run charts, but if not, you can look at the package vignette, which is the cause of most of my commits!!)

    Don’t try loading the plot right now, because it is huge, and takes ages. If we look at the summary dataframe,we can see 210 rows, a fairly decent portion of which relate to significant increases above the original median value

    ##      org_code median start_date   end_date  extend_to  run_type
    ##   1:      R0A  21430 2017-10-01 2018-10-01 2019-03-01  baseline
    ##   2:      R1F   3477 2016-04-01 2017-04-01 2017-05-01  baseline
    ##   3:      R1H  28843 2016-04-01 2017-04-01 2019-03-01  baseline
    ##   4:      R1K  11733 2016-04-01 2017-04-01 2019-03-01  baseline
    ##   5:      RA2   5854 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  ---                                                           
    ## 206:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 207:      RLT   6977 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 208:      RQ8   8456 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 209:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 210:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained

    Let’s use skimr to get a sense of this

    ## Skim summary statistics
    ##  n obs: 210 
    ##  n variables: 6 
    ## -- Variable type:character -----------------------------------------------------------
    ##  variable missing complete   n min max empty n_unique
    ##  run_type       0      210 210   8   9     0        2
    ## -- Variable type:Date ----------------------------------------------------------------
    ##    variable missing complete   n        min        max     median n_unique
    ##    end_date       0      210 210 2017-04-01 2019-03-01 2017-04-01        9
    ##   extend_to       0      210 210 2017-05-01 2019-03-01 2019-03-01        7
    ##  start_date       0      210 210 2016-04-01 2018-07-01 2016-04-01        9
    ## -- Variable type:factor --------------------------------------------------------------
    ##  variable missing complete   n n_unique                     top_counts
    ##  org_code       0      210 210      139 RA4: 3, RDD: 3, RDE: 3, RGN: 3
    ##  ordered
    ##     TRUE
    ## -- Variable type:numeric -------------------------------------------------------------
    ##  variable missing complete   n   mean      sd   p0     p25  p50      p75
    ##    median       0      210 210 9389.8 4317.54 3477 6468.25 8413 11311.25
    ##   p100     hist
    ##  29102 <U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>

    To keep this manageable, I’m going to filter out for areas that have median admissions > 10000 (based on the first 13 data points)

    high_admits <- p$sustained %>% 
      filter(median > 10000 & run_type == "sustained") %>%

    Then I change the org_code from factor to character, and pull out unique values. I’m sure there is a slicker way of doing this, but it’s getting late, and I don’t get paid for this..

    I use the result to create a smaller data frame

    high_admits$org_code <- as.character(high_admits$org_code)
    type1_high <- type1 %>% 
      filter(org_code %in% high_admits$org_code)

    And now I can produce a plot that fits on screen. I’ve made the individual scales free along the y axis, and added titles etc

    p2 <- runcharter(type1_high,
                     med_rows = 13, # median of first 13 points as before
                     runlength = 9, # find a run of 9 consecutive points
                     direction = "above",
                     datecol = "period",
                     grpvar = "org_code",
                     yval = "attendances", 
                     facet_scales = "free_y",
                     facet_cols = 4,
                     chart_title = "Increased attendances in selected Type 1 AE depts",
                     chart_subtitle = "Data covers 2016/17 to 2018/19",
                     chart_caption = "Source : NHSRDatasets",
                     chart_breaks = "6 months")

    Let’s look at the sustained dataframe

    ##     org_code median start_date   end_date  extend_to  run_type
    ##  1:      RCB   9121 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  2:      RDD  11249 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  3:      RDE   7234 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  4:      RGN   7912 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  5:      RJ1  12240 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  6:      RJE  14568 2016-04-01 2017-04-01 2018-05-01  baseline
    ##  7:      RJL  11262 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  8:      RQM  16478 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  9:      RRK   9584 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 10:      RTE  11303 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 11:      RTG  11344 2016-04-01 2017-04-01 2018-07-01  baseline
    ## 12:      RTR  10362 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 13:      RVV  12700 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 14:      RW6  22114 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 15:      RWE  12275 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 16:      RWF  11939 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 17:      RWP   9976 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 18:      RXC   9396 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 19:      RXH  12494 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 20:      RXP  10727 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 21:      RYR  11578 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 22:      RCB  10062 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 23:      RDD  12093 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 24:      RDE   7637 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 25:      RGN  11896 2017-05-01 2018-01-01 2018-05-01 sustained
    ## 26:      RJ1  13570 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 27:      RJE  15183 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 28:      RJL  11972 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 29:      RQM  18560 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 30:      RRK  29102 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 31:      RTE  11772 2017-05-01 2018-01-01 2018-05-01 sustained
    ## 32:      RTG  17169 2018-07-01 2019-03-01 2019-03-01 sustained
    ## 33:      RTR  10832 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 34:      RVV  13295 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 35:      RW6  22845 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 36:      RWE  18173 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 37:      RWF  12793 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 38:      RWP  10358 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 39:      RXC  10279 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 40:      RXH  13158 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 41:      RXP  11314 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 42:      RYR  11970 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 43:      RDD  12776 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 44:      RDE  15322 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 45:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 46:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 47:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained
    ##     org_code median start_date   end_date  extend_to  run_type

    And of course, the plot itself


    I haven’t looked into the actual data too much, but there are some interesting little facets here – what’s the story with RDE, RRK and RTG for example? I don’t know which Trusts these codes represent, but they show a marked increase. Of course, generally, all areas show an increase at some point.

    The RGN (top right) and RVV (mid left) show the reason why I worked on this package – we can see that there has been more than one run above the median. . Performing this analysis in Excel is not much fun after a while.

    There is a lot more I can look at with this package, and we in the NHS-R community are always happy to receive more datasets for inclusion, so please contribute if you can.

  • Don’t Repeat Yourself!

    Writing functions to reduce repetition and improve productivity

    One of the greatest benefits of using R over spreadsheets is that it’s very easy to re-use and repurpose code, for example if we need to produce the same chart over and over again, but for different cuts of the data.

    Let’s imagine that we are trying to create a plot for arrivals to A&E departments using the ae_attendances dataset from the NHSRdatasets package.

    Creating our first plot

    First we want to create a plot for all of England’s A&E departments over the last 3 financial years.

    ae_attendances %>%
      group_by(period) %>%
      # summarise at is a shorthand way of writing something like
      #   summarise(column = function(column))
      # first you specify the columns (one or more) in the vars() function (short
      # for variables), followed by the function that you want to use. You can
      # then add any additional arguments to the function, like below I pass
      # na.rm = TRUE to the sum function.
      summarise_at(vars(attendances), sum, na.rm = TRUE) %>%
      ggplot(aes(period, attendances)) +
      geom_point() +
      geom_line() +
      scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
      labs(x = "Month of Attendance",
           y = "Number of Attendances",
           title = "Attendances to A&E Departments by Month",
           subtitle = "All A&E departments in England")

    Creating a second plot

    Now, what if we wanted to run this for just a single trust? We could copy and paste the code, then add in a filter to a specific trust.

    # of course, you would usually more specifically choose which organisation we
    # are interested in! Selecting the first organisation for illustrative purposes.
    # The pull function grabs just the one column from a data frame, we then use
    # head(1) to select just the first row of data, and finally ensure that we
    # convert this column from a factor to a character
    first_org_code <- ae_attendances %>%
      pull(org_code) %>%
      head(1) %>%
    ae_attendances %>%
      filter(org_code == first_org_code) %>%
      group_by(period) %>%
      summarise_at(vars(attendances), sum) %>%
      ggplot(aes(period, attendances)) +
      geom_point() +
      geom_line() +
      scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
      labs(x = "Month of Attendance",
           y = "Number of Attendances",
           title = "Attendances to A&E Departments by Month",
           subtitle = paste("org_code =", first_org_code))

    So, what changed between our first plot and the second? Well, we’ve added a line to filter the data, and changed the subtitle, but that’s it. The rest of the code is repeated.

    Creating yet another copy of the first plot

    Let’s say we want to run this code again and create a plot for another organisation. So again, let’s copy and paste.

    But perhaps at this point we also decide that we want the label’s on the y-axis to use comma number formatting, we want to change the dots and lines to bars, and we want to colour the bars in NHS Blue.

    # the scales package has nice functions for neatly formatting chart axes
    # again, just selecting an organisation for illustrative purposes only.
    # This time, we use tail instead of head to select the final row
    second_org_code <- ae_attendances %>%
      pull(org_code) %>%
      tail(1) %>%
    ae_attendances %>%
      filter(org_code == second_org_code) %>%
      group_by(period) %>%
      summarise_at(vars(attendances), sum) %>%
      ggplot(aes(period, attendances)) +
      geom_col(fill = "#005EB8") +
      scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
      scale_y_continuous(labels = comma) +
      labs(x = "Month of Attendance",
           y = "Number of Attendances",
           title = "Attendances to A&E Departments by Month",
           subtitle = paste("org_code =", second_org_code))

    Now, we want to go back and change the rest of the plots to have the same look and feel. Well, you will have to go back up and change those plots individually, which when there’s just 3 plots then so what? It’s easy enough to go back and change those!

    But what if it’s 300 plots? Or, what if those 3 plots are in 3 different places
    in a very large report? What if those 3 plots are in separate reports? What if it wasn’t just a handful of lines code we are adding but lots of lines?

    Creating functions

    This is where we should start to think about extracting the shared logic between the different plot’s into a function. This is sometimes called “DRY” for “Don’t Repeat Yourself”. Where possible we should aim to eliminate duplication in our code.

    In R it’s pretty simple to create a function. Here’s a really simple example:

    my_first_function <- function(x) {
      y <- 3*x
      y + 1

    This creates a function called my_first_function: you assign functions just like any other variable in R by using the <- assignment operator. You then type the keyword function which is immediately followed by a pair of parentheses. Inside the parentheses you can name “arguments” that the function takes (zero or more), then finally a set of curly brackets, { and }, which contain the code you want to execute (the function’s body).

    The functions body can contain one or more lines of code. Whatever line of code is executed last is what is returned by the function. In the example above, we first create a new variable called y, but we return the value of y+1.

    The values that we create inside our function (in this case, y) only exist within the function, and they only exist when the function is called (so subsequent calls of the function don’t see previous values).

    We can then simply use our function like so:


    Which should show the value “10” in the console.

    Converting our plot code to a function

    The first thing we should look to do is see what parts of the code above are identical, which parts are similar but change slightly between calls, and which parts are completely different.

    For example, in our plot above, each example uses the same data summarisation, and the same call to ggplot. We slightly changed how we were displaying our charts (we started off with geom_point and geom_line, but changed to geom_col in the third plot). Let’s go with the chart used in the third version as our base plot.

    The subtitle’s differ slightly between the 3 plots, but we could extract this to be an argument to the function. So my first attempt at converting this plot to a function might be:

    ae_plot <- function(data, subtitle) {
      data %>%
        group_by(period) %>%
        summarise_at(vars(attendances), sum) %>%
        ggplot(aes(period, attendances)) +
        geom_col(fill = "#005EB8") +
        scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
        scale_y_continuous(labels = comma) +
        labs(x = "Month of Attendance",
             y = "Number of Attendances",
             title = "Attendances to A&E Departments by Month",
             subtitle = subtitle)

    We can now create our first 3 plots as before:

    ae_plot(ae_attendances, "All A&E departments in England")
    # as ae_plot's first argument is the data, we can use the %>% operator to pass in the data like so:
    ae_attendances %>%
       filter(org_code == first_org_code) %>%
       ae_plot(paste("org_code =", first_org_code))
    ae_attendances %>%
       filter(org_code == second_org_code) %>%
       ae_plot(paste("org_code =", second_org_code))

    Now, we’ve managed to remove most of the duplication in our code! If we decide we no longer like the blue points and line we can easily change the function, or if we want to switch to a bar chart instead of the line chart we only have to update the code once; when we re-run our code all of the plots will change.

    Of course, this leads to it’s own problems: what if we want 3 charts to have blue points but one use red? We could either add a colour argument to the function, or we could remove the logic which adds the points and lines to the chart but does everything else: then we could just add the points on at the end (or, create a red function and a blue function; each function would first call the main function before doing their own stuff).

    In Summary

    Functions allow us to group together sections of code that are easy to reuse, they make our code easier to maintain, because we only have to update code in one place, and they reduce errors by limiting the amount of code we have.

    Any time you see yourself copying and pasting code try to remember, Don’t Repeat Yourself!

    Further Reading

    This file was generated using RMarkdown, you can grab the .Rmd file here.

    Hopefully this has been a useful introduction to functions, if you are interested in learning more then the R4DS book has an excellent chapter on functions.

    Once you have mastered writing functions then you might want to read up on tidyeval: this allows you to write functions like you find in the tidyverse where you can specify the names of columns in dataframes.

    You may also want to have a go at object orientated programming, which is covered in the Advanced R book.

  • How NHS-R Community do The Apprentice…

    By Zoe Turner

    One of the tasks on the Apprentice a number of years ago was for the contestants to put on a corporate event, at no small cost to the people attending I might add. It’s a tale often told because one of the contestants was gluten free and no one had accounted for dietary needs amongst the contestants so the poor ‘gluten free lady’, as she was known, was served a fruit salad.

    The reason I’m apparently going off tangent so early in a blog, is that it struck me that the Apprentice is all about throwing people in at the deep end and seeing how they cope. It’s entertainment but clashes with the premise that these are potential apprentices to a ‘British business magnate’ (as Wikipedia calls him). Contrast this with NHS-R and how I came to be attending the Train the Trainer event at the end of 2019 and then helped to run the first of 3 local courses this January, having only just started learning R around 2 years ago.

    Analysts have many expectations made of them. They have to be technical, able to interpret data and communicate themselves clearly to non-analysts. Very rarely though will an analyst be expected to train others. Some may produce or present some training to support or mentor fellow analysts, and even then my experience has always been on the receiving end. Coupled with the fact that I’ve never really had a burning desire to teach, it was a surprise to find myself on a course on how to deliver the NHS-R ‘Introduction to R’ workshop.

    The reason I did it is that my involvement with NHS-R has led to this natural consequence of training others. I started with attending the course myself, then presented at the conference and facilitated an Introduction Course run by NHS-R but hosted by my Trust. I then didn’t hesitate in agreeing to taking on the training.

    NHS-R Community held their first two-day Train the Trainer event in Birmingham organised through AphA (Association of Professional Healthcare Analysts). I was supported to go on this by my manager, Chris Beeley, who is a huge advocate of R and Shiny. Whilst he himself has run several workshops over the years I, notably, have run zero!

    At the TtT (got to get an acronym in there) I had the opportunity to meet lots of motivated people from around the British Isles who were as keen as I was, not only to learn how to teach R but also to talk about data – that happened quite a bit in the breaks. We had an opportunity to present to each other, and that was useful as I learn especially from watching others. Everyone has their own style and it gets perfected over time but I was hugely impressed by how engaging people were and how quickly they could read about a subject that was new to them (we looked at the RStudio presentation slides https://education.rstudio.com/teach/materials/) and then go on to express clearly what they’d just learned.

    I could go on about what I learned at the course, but the proof of its value is in what I did with it. And so on 17th January, Chris and I held a workshop for 15 people from various organisations; some had travelled as far as London and Portsmouth, such is the commitment to learn R. Chris led the workshop and I did the live coding/support role which took the edge off that ‘first time’ feeling.

    This idea of having at least two people in a workshop is a good one, even when the trainer is very experienced. Chris, for example, is very used to running workshops alone, but inevitably people get stuck or things don’t work as they should and so I did the helping while Chris moved between training and coding. It felt, to me at least, like the burden was shared. It helped to ensure that no-one in the group was allowed to fall behind so far that they just gave up.

    Chris and I had gone through the slides beforehand as he’d not gone on the TtT course, and having not written them himself, I wanted to make sure he’d know what was covered. What reassured me was that, as he presented the course, there wasn’t a part of it that I didn’t understand myself and couldn’t cover if I had to take over at that moment. And so the day wasn’t as nerve-racking as I anticipated, and I had fun – which must have been noticeable to the group, as I had an email commenting on how we are clearly a happy team!

    Whilst I haven’t actually run a workshop, I think the process I’ve gone through to get to this point has certainly built up my confidence to do it. I’ve taken every opportunity NHS-R community has offered, from doing the introduction to presenting at the conference, and so this next thing – to run the training myself – hasn’t been so scary after all. I feel like a happy and well-supported apprentice of NHS-R, and the great thing about NHS-R in this is that everyone can be an apprentice too – you just have to get involved.

    Being open-source, all the slides for the trainer course and the introduction course are available on GitHub:

    Train the Trainer course materials can be found at:


    Course materials for the underlying Intro to R course are found at:


    Zoë Turner, Senior Information Analyst, Nottinghamshire Healthcare NHS Foundation Trust.

    @Letxuga007 and curator of @DataScienceNotts (formerly known as @AppliedInfoNotts)

  • Forecasting R

    By Bahman Rostrami-Tabar

    Democratising forecasting project

    The initiative sponsored by the International Institute of Forecasters provides cutting-edge training in the use of forecasting with R software in developing countries. There is no doubt that many people in developing countries cannot afford high fees to attend forecasting workshops. The project aims to make such a training accessible to those people and provide up-to-date training on the principles of forecasting and create a network of forecasters to conduct research on forecasting with social impact for less developed countries. The training has been delivered in Tunisia, Iraq, Senegal, Uganda, Nigeria, Turkey, Indonesia so far.

    We expand the “Democratising Forecasting” initiative to deliver four forecasting workshops per year for the National Health Service (NHS), UK, in collaboration with NHS-R community. We give our time for free despite the debate that such organisations may afford to pay such trainings.

    Why forecasting for NHS

    Ensuring life-saving services are delivered to those who need them require far more than money, infrastructure and scientific progress. Accurate modelling and forecasting systems can assist critical decisions that drive policy making, funding, research, and development in the NHS. Decision makers are making decisions every day with or without forecasts. However, they are more robust and well-informed in the light of what could happen in the future, and that is where forecasting becomes crucial. However, implementing forecasting principles to support decision-making process requires significant technical expertise. To that end, we aim to organize a two-day workshop on forecasting to teach participants how to apply principles of accurate forecasting using real data in healthcare.

    We are offering four workshops in 2020 as per below:

    • Monday 24th & Tuesday 25th February 2020 – Nottinghamshire – Workshop fully booked
    • Monday 22nd June 2020 – Huddersfield – Online https://www.eventbrite.co.ukworkshop is fully booked
    • Thursday 13th & Friday 14th August 2020 – Gloucestershire
    • Wednesday 13th & Thursday 14th October 2020 – West Midlands
    • Thursday 12th & Friday 13th November 2020 – Cardiff

    Workshop instructor

    Dr. Bahman Rostami-Tabar is the main instructor for the workshops. Colleagues from other Universities in the UK might join him occasionally. Bahman is a Senior Lecturer (Associate Professor) in management science at Cardiff Business School, Cardiff University, UK. He is interested in the use and the Implication of forecasting in social good areas such as health and humanitarian. He has been involved in various forecasting related projects with NHS Wales, Welsh Ambulance Service Trust and the International Committee of the Red Cross. He is currently working on projects that focus on developing innovative forecasting methodologies and their links to decision making in health and humanitarian operations.

    Who should attend

    This workshop is for you if you are:-

    • A decision maker who wants to use forecasting tools and techniques using R to empower decision making;
    • A data analyst who wants to gain in depth understanding of forecasting process.
    • A Forecaster who wants to learn how to use R software for forecasting purpose.

    What participants will learn in the workshop

    Assuming basic knowledge of statistics, participants will be able to do the following tasks when using forecasting to support decision making process in real World:

    • Determine what to forecast according to a forecasting process framework;
    • Prepare and manipulate data using functions in basic R, tidyverse and lubridate packages in R;
    • Identify systematic patterns using time series toolbox in ggplot2 and forecasting related packages in R such as forecast package;
    • Produce point forecasts and prediction intervals using R functions in forecasting related packages such as forecast and fable and user defined functions;
    • Determine the accuracy of forecasting models using statistical accuracy performance measures for point and prediction intervals;
    • Visualize, export and report result for interpretation and insights using RMarkdown.


    • Basic knowledge in statistics is assumed, e.g. “I know what normal distribution is”;
    • Basic knowledge of R is assumed, e.g. “I know what data type and data structure is”, “I know how to use a function”;
    • No knowledge of forecasting is assumed;


    Start: 09:30 a.m.

    End: 04:30 p.m.

    Refreshment breaks:

    • Morning: 11:00 – 11:20
    • Afternoon: 03:00 – 03:20 p.m.

    Lunch: 12:30 p.m. – 01:30 p.m.


    Title: Essentials to do forecasting using R

    Date: Two weeks before the workshop (TBC)

    Day 1

    1.1.Forecasting and decision making: What is forecasting? How forecasting task is different from other modelling tasks? What is the link between forecasting and decision making, how to identify what to forecast?

    1.2. Data preparation, and manipulation: how to prepare data for the forecasting task, how to clean data? How to manipulate data to extract time series?

    1.3. Time series patterns and decomposition: what could be used in data for forecasting task? how to detect systematic pattern in the data? how to separate non-systematic pattern?

    1.4. Forecaster’s toolbox: How to use time series graphics to identify patterns?  what is a forecasting benchmark? What are the simple forecasting methods that could be used as benchmark? How to generate point forecasts and prediction interval using simple forecasting methods?

    1.5. Forecast accuracy evaluation: How do we know if the forecasting method captures systematic patterns available in data? How to judge whether a forecast is accurate or not? How to evaluate the accuracy of point forecasts and prediction interval? Why do we need to distinguish between fitting and forecast?

    1.6. Exponential smoothing models: What is the exponential smoothing family? what are available models in this family? What is captured by this family? how to generate point forecast and prediction intervals using exponential smoothing models?

    Day 2

    2.1. ARIMA models: This is another important family of forecasting models. What is the ARIMA framework? what are available models in this family? What is captured by this family? how to generate point forecast and prediction intervals using ARIMA models? 

    2.2. Regression: We also look at causal techniques that consider external variables. What is the difference between regression and exponential smoothing and ARIMA? How to build a simple regression model?

    2.3. Special events: In addition to using systematic patterns in time series, we discuss how to include deterministic future special events, such as holidays, festive days, etc in models.

    2.4. Forecasting by aggregation:  How to forecast in a situation where we have a high frequency time series, e.g. daily but we need a low frequency forecast, e.g. monthly?  How do we generate forecast for items with a hierarchical or grouped time series structure?

    2.5. Forecasting by combination:  how to use an ensemble of forecasting approaches? In which conditions ensemble forecasts are better than individual methods?

    2.6. Forecasting for many time series: what is the best approach to forecast many time series? How to classify items for forecasting? How to forecast them? How to report the accuracy?  


    Workshop Booklet:

    • Materials will be provided for the workshop in RMarkdown.


    Bahman Rostami-Tabar

    Associate Professor in Management Science, Cardiff University, UK


  • The first ever train the trainer class 2019

    The Strategy Unit, AphA (Association of Professional Healthcare Analysts) and NHS-R joined forces to deliver the first ever train the trainer (TTT) class on the 11-12th December 2019 in Birmingham. We allocated spaces for two AphA members per branch and in the end we had 18 people from across the UK NHS,  including clinical and non-clinical staff who worked as commissioners, providers or national NHS bodies.

    As this was the first ever TTT course, we were quite unsure how things would turn out, especially because teaching R is not easy. One key issue is that trainers have to think about the background of novices to R – this can range from no programming experience at all to years of SQL experience. Two other challenges with R are that there are many ways of doing things (e.g. using base R vs tidyverse) and installation can be tricky for some.

    To address these challenges, we designed the course to incorporate considerable emphasis on evidence based teaching practice, plenty of learning by doing and sharing underpinned by the use of liberating structures techniques to capture the voice of delegates. The aim was to give delegates the skills, confidence and capability to deliver an introduction to R for those in healthcare, based on a field tested slide deck.

    The class was fabulous – they were engaged and enthusiastic, learned a great deal from each other and the facilitators and made great progress – https://twitter.com/NHSrCommunity/status/1205165380039299076/photo/1. Indeed it was inspiring to see how NHS staff can work effectively together, despite being from different organisations. Our first class have self-organised into an action learning set who will: train others, learn from each other, feedback their learning into the NHS-R Community, and have an annual TTT Class of 2019 reunion!

    Whilst there are areas where we would refine the TTT, we were pleased with the first iteration and feel confident that we have a committed and capable set of trainers who can deliver the Introduction to R for those in healthcare at locations near you.

    If you are interested in hosting an Introduction to R training session or being involved in future iterations of the TTT, please get in touch via nhs.rcommunity@nhs.net and we will see what can be done (no promises).

    This blog was written by Professor Mohammed A Mohammed, Principal Consultant at the Strategy Unit

  • NHS-R Community Conference II

    My journey to work takes me about an hour and a half, and I catch a couple of buses with Wi-Fi which means I can browse Twitter and invariably end up with hundreds of tabs open as I flit between articles and blogs. Most mornings I find it hard to concentrate on reading through entire articles, especially the really long ones, so I leave the tab open on my computer, often for days, before reading them. Given my experience of reading blogs, why would anyone want to read through mine about the NHS-R Community conference?

    If I’d gone to the conference I’d probably skim that paragraph thinking ‘yes, I went, I know how good it was’.

    If I’d not gone to the conference I’d probably skim that paragraph because I might prefer not to know just how great a conference was when I’d missed it!

    Even though the conference was moved to a bigger location to accommodate more people and around 250 people attended, I have still spoken to people who didn’t get a ticket or missed submitting an abstract to speak. People who never made the conference are talking about an event that is only in its 2nd year. What is going on? What is it that has made the event so successful?

    Organising an event of any size takes a lot of work and that is often overlooked. There were the core people who did the real work – the arrangements – and quite frankly, they made it look easy, which itself is an indication of how hard they worked. But there were others who were part of a committee that chipped in with bits they could help with: setting up a specific email; reading through abstracts; suggesting things the organisers might consider, like how to ensure diversity of questioners (https://oxfamblogs.org/fp2p/how-to-stop-men-asking-all-the-questions-in-seminars-its-really-easy/).

    That organising committee was made up from a group who have shown a particular interest in R, and as such I found myself part of that group. Now although I have submitted a few blogs to NHS-R, I only really started using R a couple of years ago. Deep down I’m still a SQL analyst and my contributions to the conference were pretty minimal, but I feel encouraged to make those small contributions (even that last one about who gets to ask the first question in seminars) and each small involvement builds up to a bigger thing. This really is feeling like an equal and inclusive group and that’s where I think this success is coming from.

    It may have been by design or it may be a happy accident but there is a crucial clue in the name of this group that gives away its success – Community. This conference wasn’t managed top-down. There are some key people, of course, but they are as much of this Community as the people who contribute to the blogs, those that stood up on stage and showed their work, have those that will be learning to run the R Introduction training. This is our NHS-R Community.

    If you missed this year’s conference and want to go to the next one, get involved. The more people involved, the less work there is for everyone individually. Plus, given that tickets this year ran out in just 2 hours, you’ll be more likely to secure yourself a ticket.

    Speaking of which, provisional dates for the next conference are the 2nd and 3rd November 2020 (Birmingham). Now aren’t you glad you read this blog!

    Zoë Turner, Senior Information Analyst @AppliedInfoNott @Letxuga007

  • Welcome to Leeds

    R Users Leeds

    Why create a group ?

    Trying to learn R is tricky, there are so many ways to do things that I often ended up Googling everything and still not feeling like I was doing anything the right way.

    Then someone sent me a link to the NHS-R Community website where I found lots of interesting and helpful information. Plus there were R groups, but not one in Leeds. I was surprised by this given the density of health analysts and other R users in and around Leeds. There should definitely be a group in Leeds.

    How to create a group ?

    So I ran my idea past a colleague to check if I was crazy. They agreed to help anyway and offered to email some contacts at the University of Leeds. I approached the Sheffield-R group and the NHS-R community (who also thought it was a good idea). We also connected with NHS Digital and NHS England and Improvement.

    Through this process we were very lucky to find some clever individuals who had already thought about starting up a group in Leeds. This meant the group already had a website (R Users Leeds), Twitter, email and logo. This gave us a great starting point with a planning team covering Leeds Teaching hospitals NHS Trust, NHS Digital, University of Leeds, Leeds Beckett University and Sky Betting and Gaming. We are very lucky in Leeds to have access to both local and national NHS organisations, academic institutions and digital and technology companies.

    What do we want from a group ?

    From the beginning it was important to think about the aims and ethos of the group which for Leeds are:

    • Create an open and inclusive community
    • Meetings should be free
    • Organise meetings within and outside office hours to ensure everyone has the opportunity to participate.
    • Include the NHS, academia and other industries using R

    Where will the group meet ?

    Another consideration was how to get venues especially at low or no cost. After making contacts I was assured that I could get things for free, I wasn’t too sure but thought that I should give it a go. This is going to be an ongoing task but I have been surprised by how many organisations want to help. Following a first planning meeting we managed to organise speakers and get a date and venue for the first meeting:

    5th December 2019, NHS Digital, Leeds

    • Reproducible wRiting with RMarkdown. Ed Berry, Senior Data Scientist, Sky Betting & Gaming
    • Efficient WoRkflows getting more done with R. Robin Lovelace, University Academic Fellow, University of Leeds
    • Putting the R into Reproducible Research. Anna Krystalli, Research Software Engineer, University of Sheffield

    R we nearly there yet ?

    Our first event is almost upon us so I suppose we have nearly reached the first destination. With tickets booking up quickly it is clear to see that there is a need for a group in Leeds to bring the analytical community together.Thank you to everyone who has supported us so far, hopefully this is just the beginning of the journey.

    If you would like to speak at R Users Leeds please get in touch.

    You can keep in touch with us on Twitter, GitHub, Gitter and email.

    R Users Leeds

    This blog was written by Louise Hick, Real World Data Analyst, the Leeds Teaching Hospitals NHS Trust
  • A new kid on the NHS-R block

    My journey into data science took a baby step a few years ago when I learned SQL during my global health work overseas. A dearth of good data and data sets is every global health worker’s nightmare.

    Realising R in data science is the next best thing since sliced bread, I decided to give it a go. My initial introduction was aided by my daughter who volunteered to teach me during her summer break from UCLA. That was a terrible idea. We didn’t speak to each other for days. (It was worse than learning to drive from your dad!) My self-esteem took a nosedive.

    I searched through various resources such as YouTube, MOOC, and edX, which only confused me further. Then I stumbled across the NHS-R Community website and voila!

    No more learning about irises, motorcars in the USA, or global flight statuses. The NHS-R Community provided relevant teaching materials presented in the simplest possible way. The gap minder data set is a Godsend for any global health worker. I enthusiastically attended all the R workshops I could. R was becoming an addiction.

    Last week, I attended NHS-R Community conference in Birmingham. I am pleased to say this conference was worthy of the investment of my time, money and energy. All the ‘biggies’ in R were there to facilitate you through your anxieties (although, sadly, no Hadley Wickham!).

     Meeting absolute novices was a great boost for previously dented self-esteem. Helping these novices made me realise how it is actually possible to learn more by teaching and finding ways to explain concepts for others to understand.

    My biggest achievement is converting my husband (a senior consultant in the NHS) to R. He went from rolling his eyes at my NHS-R-venture in early summer to signing up to NHS-R community this week: a long stride in a short time. I have managed to corrupt his hard drive with R. Let’s see how far he is willing to go!

    I have come across quite a few reluctant R-ists like my husband in the NHS. So now the question is: how do we make a success of converting people at various stages of their careers into using a wonderfully useful open-source resource like R?

    Melanie Franklin, author of ‘Roadmap to Agile Change Management’, puts across the core principle of creating small, incremental changes to be implemented as soon as possible so that organisations can realise benefits sooner rather than later and achieve rapid return on investment. This is precisely what the NHS-R Community can do.

    An understanding of the behaviours and attitudes is key for change implementation. When a change is sought in an organisation, there is a period of ‘unknown’ amongst the staff. Questions are asked. Do we learn completely new things? Do we unlearn what we had learned before? Will this change make us redundant? This anxiety can be age-related. While younger members are ready to embrace change, senior staff may resist change and become an obstacle. Encouraging people to participate in taking that initial step at every level is imperative, as is encouraging them to express their anxieties. This approach should not just be top-down or bottom-up. Making a mistake is a part of a process and creating a no-blame culture within an organisation is essential. Building an environment that supports exploration and rewarding and celebrating participation in new experiences.

    Health data science is a melange de competences: not just a group of statisticians and data analysts. A data analyst will no doubt do a great job in analysing your data, but will they ask the right questions in the right context in an environment like the NHS? This space belongs to the health care workforce. As Mohammed rightly pointed out in Birmingham: ‘Facilitate an arranged marriage between two and hope for a happy union’. Lack of time to practice is one of the biggest challenges many of my fellow learners had cited. Unfortunately, there is no simple solution. Just keep trudging along and you will get through this steep learning process.

     En finalement, I would like to propose a few suggestions to the NHS-R Community (If they are not already being thought about).

    • Streamlining the training.

    Many groups up and down the country are providing training in R, but these fantastic training opportunities are few and far. The NHS-R Community is a great body to ensure that the training programs are more frequent and streamlined.

    • Standardising the training.

    Diverse levels of these training programs mean a varied level of R-competence. The NHS-R Community could set up levels of competence (Levels I to III, for instance) and people can advance their level of competence in subsequent workshops, which may motivate people to find time to practice. An online competency exam for each level may be another way to build the capacity of the NHS workforce.

    • Accreditation and validation of R skills. This is an eventuality just like any skill in health care.

    Happy R-ing!!

    I am hoping to up my game to assist overseas researchers who don’t have means and mentors to learn R. There are many synthetic datasets available for them to learn and practice. I will continue learning as there is never an ending to learn new things.

    Nighat Khan is a global ehealth worker based in Edinburgh

  • NHS Number Validation

    NHS Numbers – the backbone of the NHS

    NHS Numbers are the best. They are numbers that follow you through life, from birth to death, changing only with certain situations and are easy to check for validity from just the numbers themselves.

    What? You can check to see if a number is a genuine NHS Number without any other information?!

    Now, I’ve worked in the NHS for more years than I care to count and I never realised there was an algorithm to these numbers. I stumbled across this fact through Twitter of all places. Yes, Twitter, the place you’ll find NHS-R enthusiasts tweeting things like this:

    https://twitter.com/sellorm/status/1171858506057682944 which I saw retweeted by @HighlandR (John MacKintosh).

    I’ve only just started doing things on GitHub but @Sellorm wasn’t very close in saying there may be millions of packages as there are, in fact there are only 2 for R, but surprisingly, not a single one in SQL.

    I installed the package, had a quick play around and looked at the code on GitHub. The downside for this package, for me, was that you need to feed in your data using vectors and I like using data tables. Plus, I didn’t necessarily want a list of outputs like TRUE, TRUE, FALSE, TRUE but I wanted to see the NHS numbers that aren’t valid. Still, I wouldn’t have got so far, so quickly, without @Sellorm’s code, excellent notes and thoroughly written Readme file and so the moral of the story is, even if the package doesn’t work for a task or problem, you may still be able to use parts of the code.

    The Readme on https://github.com/sellorm/nhsnumber, which you can see when you scroll down past all of the file directory looking things, included a link to the wiki page on NHS Numbers and its algorithm check https://en.Wikipedia.org/wiki/NHS_number. Again, I had no idea this existed!

    Wiki, like R, is open sourced and one of the links was out of date. I’d found a really useful document on NHS Numbers a while back https://www.closer.ac.uk/wp-content/uploads/CLOSER-NHS-ID-Resource-Report-Apr2018.pdf, so, in the spirit of open source, I updated the Wiki page. Proud moment!

    The algorithm is quite simple and another package on GitHub https://github.com/samellisq/nhsnumbergenerator generates numbers using it but I didn’t spend too long on these packages as I decided to do my own script, nothing fancy, no loops, functions or packages…

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber),
             A = as.numeric(substring(NHSNumber,1,1)) *10,
             B = as.numeric(substring(NHSNumber,2,2)) *9,
             C = as.numeric(substring(NHSNumber,3,3)) *8,
             D = as.numeric(substring(NHSNumber,4,4)) *7,
             E = as.numeric(substring(NHSNumber,5,5)) *6,
             G = as.numeric(substring(NHSNumber,6,6)) *5,
             H = as.numeric(substring(NHSNumber,7,7)) *4,
             I = as.numeric(substring(NHSNumber,8,8)) *3,
             J = as.numeric(substring(NHSNumber,9,9)) *2,
             End = as.numeric(substring(NHSNumber,10,10)),
             Total = A+B+C+D+E+G+H+I+J,
             Remainder = Total %% 11,
             Digit = 11- Remainder,
             Summary = case_when(Digit == 10 ~ 999,
                                 Digit == 11 ~ 0,
                                 TRUE ~ Digit),
             Valid = case_when(Summary == End & length == 10 ~ TRUE,
                               TRUE ~ FALSE
                               )) %>% 
      filter(Valid == FALSE)

    Importing data

    Data is imported into R via Excel or, in my case, I used a SQL connection which worked better as a couple of our systems hold many hundreds of thousands of NHS Numbers and I wanted to check them all. Excel might have got a bit of indigestion from that. Also, it meant I wasn’t storing NHS Numbers anywhere to then run through R. My systems are secure but it’s always worrying having such large amounts of sensitive data in one place and outside of an even securer system like a clinical database.

    Data doesn’t always import into R the same way and for mine I needed to remove NULLs and make the NHSNumber column, which was a factor, into a character and then numeric format:

    df <- data %>% 
      filter(!is.na(NHSNumber)) %>% 
      mutate(NHSNumber = as.numeric(as.character(NHSNumber)))


    Factors are a new thing for me as but, as I understand it, they put data into groups but in the background. For example, if you had male, female and other these would be 3 factors and if you join that to another data set which doesn’t happen to have “other” as a category the factor would still linger around in the data, appearing in a count as 0 as the table still has the factor information but no data in that group.

    To manipulate factor data I’ve seen others change the format to character and so that’s what I did. I’ve done similar things in SQL with integers and dates; sometimes you have to change the data to an ‘intermediate’ format.

    Validity code – explanation

    This code uses dplyr from tidyverse to add columns:

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber), 

    An NHS number must be 10 characters and nchar() reminds me of LEN() in SQL which is what I would use to check the length of data.

    One thing I didn’t code, but I guess people may want to check, is that all characters are numeric and no letters have been introduced erroneously. That’s something to consider.

         A = as.numeric(substring(NHSNumber,1,1)) *10,
         B = as.numeric(substring(NHSNumber,2,2)) *9,
         C = as.numeric(substring(NHSNumber,3,3)) *8,
         D = as.numeric(substring(NHSNumber,4,4)) *7,
         E = as.numeric(substring(NHSNumber,5,5)) *6,

    I didn’t use F as it’s short for FALSE which changed the meaning and colour of the letter. It can be used as a column name but I missed it out for aesthetics of the script!

         G = as.numeric(substring(NHSNumber,6,6)) *5,
         H = as.numeric(substring(NHSNumber,7,7)) *4,
         I = as.numeric(substring(NHSNumber,8,8)) *3,
         J = as.numeric(substring(NHSNumber,9,9)) *2,

    There is possibly a much smarter way of writing this, perhaps a loop. Samuel Ellis’ package nhsnumbergenerator creates a table using sequences:

    checkdigit_weights = data.frame(digit.position=seq(from=1, to = 9, by =1),
                                    weight=seq(from=10, to = 2, by =-1)

    note he uses = rather than <-

    I like this code. It’s very concise, but it just was easier writing out the multiplications for each part of the NHS Number; the first number is multiplied by 10, the second by 9, the third by 8 and so on.

         End = as.numeric(substring(NHSNumber,10,10)),

    I put this in as the final number in the sequence is the check for the later Digit (also called checksum).

         Total = A+B+C+D+E+G+H+I+J,

    Just adding all the multiplied columns together.

         Remainder = Total %% 11,

    This gives the remainder from a division by 11.

         Digit = 11- Remainder,

    11 take away the remainder/checksum.

         Summary = case_when(Digit == 10 ~ 999,
                             Digit == 11 ~ 0,
                             TRUE ~ Digit),

    Lots of people use if(else) in R but I like case_when because it’s like SQL. The lines run in logical order:

    • when the digit = 10 then it’s invalid so I put in 999 as that’s so high (in the hundreds) it shouldn’t match the Digit/Checksum (which is a unit),
    • if it’s 11 then change that to 0 following the methodology,
    • else just use the digit number.

    Actually, thinking about it I probably don’t need the 10 becomes 999 as 10 could never match a single unit number. Perhaps that’s redundant code.

         Valid = case_when(Summary == End & length == 10 ~ TRUE,
                           TRUE ~ FALSE
                           )) %>% 

    Case_when again but this time to get the TRUE/FALSE validity. If the number generated is the same as the last digit of the NHS Number AND the length of the NHS Number is 10 then TRUE, else FALSE.

    I liked this like of code as it was a bit strange saying TRUE then FALSE but it’s logical!

      filter(Valid == FALSE)

    Just bring back what isn’t valid.

    Did it work?

    I think it did. If I got anything wrong I’d love to get feedback but I ran through many hundreds of thousands of NHS Numbers through it and found…..

    No invalid numbers

    I possibly should have checked beforehand but I suspect our clinical systems don’t allow any incorrect NHS Numbers to be entered in at source. Still, it was fun and could be applied to manually entered data from data kept in spreadsheets for example.

    An interesting blog

    As here were no SQL code scripts on GitHub for NHS Number validations I did a quick search on the internet and found this: https://healthanalyst.wordpress.com/2011/08/21/nhs-number-validation/ which is a blog by @HealthAnalystUK from 2011. The reason I’m referring to it in this blog is because HealthAnalystUK not only shared SQL code that looked very similar to the code here but also R and uses it as a function:

    NHSvalidation <- function(NHSnumber){
    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))
    if (
    ((Modulus==J) & (UniformNumberCheck!=1) & (NHSlength==10))|((Modulus==11) & (J==0) & (UniformNumberCheck!=1) & (NHSlength==10)))

    I hadn’t coded the check for repeating numbers:

    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))

    and I couldn’t find any reference to this in the Wiki page or the document from the University of Bristol so I’m unsure if this is a part of the methodology. If it is, then I’ve seen at least 1 NHS Number that would fail this test.

    A conclusion

    If anyone has created a package or script for NHS number checks and wants to share please feel free to write a blog. NHS-R Community also has a GitHub repository at https://github.com/nhs-r-community where code like this blog can go (I wrote this in Rmarkdown).

    Blogs can be emailed to nhs.rcommunity@nhs.net and get checked by a group of enthusiast volunteers for publishing.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • The NHS-R Conference 2019

    We really enjoyed our second ever NHS-R conference in Birmingham, which was attended by about 300 delegates. We tried to ensure that there was something for colleagues who are new to R as well as for those who are more experienced. The conference was inspiring, exciting, friendly and tiring and we obtained some fantastic feedback from attendees about how much they enjoyed the conference.

    “Let me take this opportunity to thank you for organising a brilliant conference. I learnt more in two days than I had in weeks on my own. Your enthusiasm is extremely catchy and inspirational.  Although I am relatively new in R, I am hooked on to it and wonder why I didn’t learn it before. I have already converted one person to R, my husband a consultant Psychiatrist in Edinburgh 🙂 …Once again thank you for your wonderful interactions during two days.”

    Dr Nighat Khan – University of Edinburgh and novice R user.

    “I think that the organisation was excellent, and the quality of the workshops and talks offered was extremely high…. I thoroughly enjoyed catching up with people that I already knew as well as meeting new friends. I have also come away with new skills to hone and new techniques to adopt and share in my work.”

    Caroline Kovacs – PhD Researcher, University of Portsmouth.

    “I just wanted to say I really enjoyed the event and (it was) probably the best conference I’ve attended.”

    Matthew Francis – Principal Public Health Intelligence Analyst and advanced R user.

    “I don’t think I’ve got the words to describe how fantastic the event was, fantastic just isn’t enough. I even presented at this conference after only learning R in the last couple of years. Truly, this was only possible because of the NHS-R community.”

    Zoe Turner– Senior Information Analyst at Nottinghamshire Healthcare NHS Foundation Trust

    We had some technical glitches on the day and some of the rooms were not ideal, so we want to thank our delegates for being so forgiving, patient and enthusiastic. Likewise, we want to thank our conference organisers, our helpers, our speakers, our funders – The Health Foundation – our sponsors (Mango Solution, Jumping Rivers, Healthcare Evaluation Data from University Hospitals Birmingham), our partners (AphA, The Strategy Unit, Improvement Academy, University of Bradford,  HDRUK) and all the staff at the Edgbaston Cricket Ground for helping make our conference a huge success.

    We have already started thinking about next year’s conference, and if you have any suggestions on how to make the next one better please let us know via twitter (@NHSrCommunity) or email (nhs.rcommunity@nhs.net).

    We also made some important announcements about NHS-R Community 2.0. This will be covered in a subsequent blog, but for now there is an announcement of free tickets to the EARL (Enterprise Applications of the R Language) conference 2020 which has been described as “The highlight of the R calendar in Europe” by  Andre Andrie de Vries of RStudio.

    Following a guest invitation for Mohammed A Mohmmed to talk about the NHS-R Community at the EARL conference in Sep 2019, Liz Mathews (Head of Community and Education at Mango Solutions) has generously offered members of the NHS-R Community 6 FREE tickets for EARL 2020 to be held on the 8th – 10th September at The Tower Hotel, St Katharine’s Way, London E1W 1LD. These tickets are worth £900 each and includes a day of hands-on-training workshops. So how might you win a free ticket?

    1. In 200 words maximum, tells us in an email to nhs.rcommunity@nhs.net “Why we should give you the free ticket and how will you ensure this will help the NHS-R Community?”
    2. The subject of the email should state “EARL 2020”.
    3. The last two lines of your email should state that:
      • you confirm that you will arrange your own travel and hotel expenses and that you will get approval from your employer where applicable.
      • In case you win a ticket but are unable to attend you will let us know ASAP so that we can allocate your ticket to someone else.
    4. For recorded talks, photos and highlights from the EARL conference 2019, which provides an excellent overview of what the event entails, please see https://earlconf.com/.
    5. You have until 31 Dec 2019 to submit your entries. Winners will be announced in Jan 2020.
    6. A small panel will judge the best entries and put then put them in a hat.
    7. We will randomly select 6 of the best entries and notify the selected person by email.
    8. If you do not claim your ticket within 3 working days of the notification email, we will exclude you from the process and randomly select someone else for the ticket.
    9. We will repeat the process until all tickets are allocated.
    10. In the event of tickets not being allocated, the NHS-R Team will allocate them as they see fit.

    The NHS-R Community Team

  • NHS-R Community datasets package released

    This post briefly introduces an R package created for the NHS-R Community to help us learn and teach R.

    Firstly, it is now available on CRAN, the major package repository for R, and can be installed like any other package, or directly from GitHub as follows:



    Several community members have mentioned the difficulties learning and teaching R using standard teaching datasets. Stock datasets like iris, mtcars, nycflights13 etc. are all useful, but they are out-of-context for most NHS, Public Health and related staff. The purpose of this package to provide examples datasets related to health, or reusing real-world health data. They can be accessed easily and used to communicate examples, learn R in context, or to teach R skills in a familiar setting.

    For those of us wanting to contribute to Open Source software, or practise using Git and GitHub, it also provides an opportunity to learn/practise these skills by contributing data.

    What’s in it?

    At present we have two data sets:

    Name Contributor Summary
    LOS_model Chris Mainey Simulated hospital data, originally for learning regression modelling
    ae_attendances Tom Jemmett NHS England’s published A&E attendances, breaches and admission data

    Both datasets have help files that explain the source, data columns, and give examples of use. The package also contains vignettes that show the datasets being put to use. You can find them in the help files, or using:

    # list available vignettes in the package:
    # Load the `ae_attendances` vignette:

    How can you contribute?

    We are aiming to build a repository of different health-related datasets. We are particularly keen to include primary care, community, mental health, learning disabilities, or public health data.

    Please head to the GitHub page, and follow the submission guidelines on the README page. If you would like to add a vignette or alter an existing one, follow the same instructions: fork the repository, commit and push your changes, then make a ‘pull request.’ We will then review your submission before merging it into the package.

    Please ensure that any data you submit is in the public domain, and that it meets all obligations under GDPR and NHS/Public Health information governance rules.

    To see the package code, or contribute, visit https://github.com/nhs-r-community/NHSRdatasets .


    The NHSRdatasets package is a free, collaborative datasets package to give context for NHS-R community when learning or teaching R. The package is available on CRAN. We actively invite you to submit more datasets and help us build this package.

    To read the reference material, and vignettes, you can find them in the package help files or go to the pkgdown website: https://nhs-r-community.github.io/NHSRdatasets/

    Chris Mainey
    Intelligence Analyst

    Health Informatics – University Hospitals Birmingham NHS Foundation Trust

  • How R changed me as an analyst

    I suspect there are many blogs about R and why it’s so great:

    • It’s free!
    • It’s open source!
    • There’s a great community!
    • It’s reproducible!

    You can certainly read on Twitter (#rstats) about what R cando for you but what about what R does to you, particularly as an analyst in the NHS or social care?

    The learning bit…

    Back in April 2018 when NHS-R Community was running its first introduction to R course in Leeds, my only knowledge of it had come from a free online base R course with edX that I hadn’t really understood and never finished. Online learning for me is like being back at school. I get the right answer, feel good, move on and promptly forget it all. After the NHS-R course I dabbled with the dplyr package, tried to run lots of other people’s scripts and generally failed a lot. It was a frustrating time of slow delivery of tasks and bafflement. When things did work, I had no idea why and I often wondered what all the fuss was about particularly as I could do the same things in familiar programmes.

    Hindsight is a wonderful thing and I can now see my frustrations weren’t just one lump of confusion, but could be split down into how I used these ‘familiar programmes’, namely:

    • SQL for data engineering and
    • Excel for visualisations.

    Although I still used (and use) SQL to get my data, I was copying it to Excel and then loading it into R; once loaded I’d then realise I needed to group by and count or remove something I didn’t need and it seemed too long-winded going back to SQL, copying to Excel and then loading it.

    The second frustration of visualising in R came with the inability to replicate the look of the Excel charts in R: getting the same colours, the same font size headers and so on. I’ve yet to resolve that completely but it was here that I realised the worth of R wasn’t in making it look like Excel, but rather that it could do so much more than Excel. I needed to start thinking about what I should be visualising and how to do it.  

    Sharing methodology

    Over my time in the NHS I have learned to be cautious – protective even – of data. But that has led to the misguided fear of sharing technical knowledge which was never a conscious thing, just that’s how it is. However, R has a reputation of sharing which has resulted in an expectation of sharing. And that isn’t just within your own team or organisation, it’s so much wider – it can even be the world.

    As an example of why it’s harder to share Excel methodology, I’ve previously built a benchmarking Excel spreadsheet using MATCH and INDEX so that the bar charts automatically coloured the organisation I worked for and ordered the bars in graphs from the greatest to the smallest. It was one of those tasks that took a lot of effort to automate, looked simple when done but was heavily dependent on the data being in just the right cell or it would break.

    Just updating it with a new year’s data would take great care so the thought of writing out the methodology to share never occurred to me. Writing it out would involve describing the positions of data, what the formulae did and how bits all linked. That’s a laborious task which is not necessary if you don’t plan to share – and as there was no need to share, I didn’t.

    It’s all about the data

    The entire process, from SQL to Excel, is about the data, e.g. how it joins and what it’s counting. To get the data ‘just so’, it often requires so many clever solutions to so many problems that, as I now realise, it consumes so much thinking time that there’s often little energy left for considering why I am doing this – is it the best thing to do and how can I get more meaning from the data?

    If I’d picked up someone else’s script or Excel document on ordering benchmarking data, perhaps the time I spend would be on improving it instead of building it. In a perfect world, I would feed back on what I’d done and share any improvements or corrections.

    But what can R do that SQL and Excel can’t?

    As a very simple example, consider creating a median average.

    In Excel it’s reasonably easy to use the formula MEDIAN() but to make it dynamic (such that it automatically updates if you add more data), the formula becomes much more complicated. Here is a page explaining how it’s done:


    There are lots of graphics are used to describe how to do it and you’ll note this is for AVERAGE which is mean rather than median.

    In SQL, creating the MEDIAN can be solved various ways:


    There are 204 examples to go through to solve this! I didn’t go through it as that’s too much needless thinking required when R can do this…


    Something this simple removes all that otherwise inevitable necessary thinking to figure out the best way to get the median… and then having to check that it is right. Although that may be easy enough in Excel, I know I make mistakes and will have to repeat the exercise more times than I care to admit, and doing so in Excel will involve so many steps that each require checking. All of this uses up that precious resource of focused thinking. With R doing the job so quickly and reliably I now have time to consider if median is actually the right thing to use, particularly if it’s a case of “we’ve always done that”. Then I can ponder on what is it telling me; is it increasing/decreasing and why is that? Is there any other data that could help explain this change?  

    Like any great piece of analysis or coding, time invested at the beginning pays off. With the benchmarking example of reordered bar charts, spending a couple of days getting it right made something interactive that was engaging for the users. But Excel continues to require thinking and time checking whereas R doesn’t; once it’s coded, that’s it. And that’s where the analyst should be, that space after the work is done. That’s where you can think about the very analysis itself; was it enough? What does it tell you? What else can you look at? Although I’ve had analyst in my job title for a long time, this is what analysis is all about and it’s something, I now realise, that I’ve not been doing because my “thinking time” has been used up elsewhere.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • Dygraphs

    Dygraphs for mortality surveillance

    I recently presented some of the mortality surveillance charts we use to @RLadiesLondon (a very welcoming group!) and one that got some interest was a chart of Nottinghamshire Healthcare NHS Foundation Trust deaths compared to ONS Provisionally Registered deaths. The chart looks good because it’s interactive but this type of chart can be confusing because of the 2 y axes.

    When I show this report I make it clear that the two axes units are very different and that its purpose is to show that the pattern of the deaths in the wider population matches that of the deaths recorded by the Trust. It’s well known within Public Health that the pattern of deaths is seasonal, with a peak around January in the UK. However, this Public Health knowledge is not necessarily common knowledge in Secondary Care Trusts and it was one of the great benefits of having @IantheBee both create and present this report.

    Getting ONS Provisional Data

    I wrote about getting and formatting the spreadsheets from ONS for the East Midlands Provisionally Registered deaths:

    but for the purposes of the mortality surveillance report I’ve used several years data and I normally keep the spreadsheets, save the merged data and then load that each time I need to run the R markdown report.

    For the purposes of this blog I’ve amended the formatting code by adding functions so each year can be transformed and is available to plot:


    # Download ONS spreadsheets Function-----------------------------------------------

    GetData_function <- function(www,file){
    destfile = file,
    method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
    mode = "wb") #wb means "write binary" }




    #2016 GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2016/publishedweek522016.xls",

    # Import correct sheets --------------------------------------------------- Deaths_Now <- read_excel("DeathsDownload2019.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2018 <- read_excel("DeathsDownload2018.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2017 <- read_excel("DeathsDownload2017.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2016 <- read_excel("DeathsDownload2016.xls ",
    sheet = 4,
    skip = 2)

    # Look up code to remove excess rows -------------------------------------- LookupList <- c("Week ended",
    "Total deaths, all ages",
    "Total deaths: average of corresponding",
    "E12000004" )

    # Function for formatting data -------------------------------------------- Transform_function <- function(dataframe){

    #Format data frames
    df <- dataframe %>%
    clean_names %>%
    remove_empty(c("rows","cols")) %>%
    filter(week_number %in% LookupList)

    #Transpose table
    df <- t(df)

    #Whilst this is a matrix make the top row the header
    colnames(df) <- df[1, ]

    #Make this object a data.frame
    df <- as.data.frame(df)

    #Function to find 'not in'
    '%!ni%' <- function(x,y)!('%in%'(x,y))

    #List of things to remove to tidy the data.frame
    remove <- c("E12000004", "East Midlands")

    #remove the rows and ensure dates are in the correct format
    df <- df %>%
    filter(E12000004 %!ni% remove) %>%
    mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern")) df$`Week ended` <- as.Date(df$`Week ended`, format = '%Y-%m-%d') df <- df %>% mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    #Final transformation of data
    df %>% select(`Total deaths, all ages`,date) %>%
    filter(!is.na(`Total deaths, all ages`)) %>%
    mutate(`Total deaths, all ages` = as.numeric(as.character(`Total deaths, all ages`))) #To match other data.frames }

    # Run functions -----------------------------------------------------------
    DeathsNow <- Transform_function(Deaths_Now)
    Deaths2018 <- Transform_function(Deaths_2018)
    Deaths2017 <- Transform_function(Deaths_2017)
    Deaths2016 <- Transform_function(Deaths_2016)

    # Merge data -----------------------------------------------------
    Deaths <- bind_rows(DeathsNow,
    Deaths2016) %>%
    mutate(date = as.Date(date),
    `Total deaths, all ages` = as.numeric(`Total deaths, all ages`))

    This code may give a few warnings saying that NAs have been introduced by coercion which is because there are many cells in the spreadsheets that have no data in them at all. It’s a good thing they have nothing (and effectively NAs) as having 0s could confuse analysis as it isn’t clear if the 0 is a real 0 or missing data.

    To suppress warnings in R Markdown add warning=FALSE to the header, however, I like to keep the warnings just in case.

    If you want to keep all the data after merging it together use:

    library(openxlsx) # To write to xls if required. 

    #Export complete list to excel for future
    write.xlsx(Deaths, 'ImportProvisionalDeaths.xlsx')

    If you’ve saved the combined file, to call it again in a script use the following code:

    library(readxl) Deaths <- read_excel("ImportProvisionalDeaths.xlsx") 

    Dygraph chart

    The following data is randomly generated as an example:


    #Fix the randomly generated numbers set.seed(178)

    Alldeaths <- Deaths %>%
    select(date) %>%
    mutate(n = rnorm(n(), mean=150))

    Merge the two data frames together:

    ONSAlldeaths <- Deaths %>%    
    left_join(Alldeaths, by = "date") %>%
    mutate(ds = as.POSIXct(date)) %>%
    select(ds, y2 = n, y = `Total deaths, all ages`) %>%

    Dygraphs require dates to be in a time series format and the package xts can convert it:

    ONSAlldeaths_series <- xts(ONSAlldeaths, order.by = ONSAlldeaths$ds) 

    The date column is no longer needed so can be removed but this needs to be done using base R and not dplyr:

    #Remove duplicate date column ONSAlldeaths_series <- 
    ONSAlldeaths_series[, -1]

    Finally, the xts can be plotted:

    dygraph(ONSAlldeaths_series, main = "East Midlands Weekly Deaths/Randomly generated numbers") %>%   
    dySeries("y", axis = "y", label = "East Midlands") %>%
    dySeries("y2", axis = "y2", label = "Random numbers") %>%
    dyAxis("x", drawGrid = FALSE) %>%
    dyAxis("y", drawGrid = FALSE, label = "East Midlands") %>%
    dyAxis("y2", independentTicks = TRUE, drawGrid = FALSE, label = "Random numbers") %>%
    dyOptions(stepPlot = TRUE) %>%

    Note this chart is not interactive in this blog.

    When you’ve plotted the chart if you wave the cursor over the points you will see information about those points, you are also able to zoom in by using the scrolling bar at the bottom of the chart (this was from the dyRangeSelector() code.

    Other options are detailed here: https://rstudio.github.io/dygraphs/index.html

    ONS Provisional data

    One of the things that may stand out in the chart for the are the big dips around 29-31 December time each year and we presume that these relate to the week where Registrations may be delayed from GP practices to ONS because of the public holidays around Christmas.

    Unfortunately, only provisionally recorded deaths are available by week as confirmed are monthly: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence

    Note this chart is not interactive in this blog.

    The case of the mysteriously disappearing interactive graph in R Markdown html output

    I’d rendered (knit or run) the html reports with the interactive graphs and it had all worked so I emailed the report to people as promised and then the emails came back: “Some of the graphs are missing, can you resend?”. Perplexed, I opened the saved file from the server and, yes, indeed some of the charts had disappeared! Where there should be lovely interactive charts were vast swathes of blank screen. What had happened? The code ran fine, looked fine and how do you even google this mystery?

    Turns out my default browser, and I suspect it is throughout most of the NHS because lots of NHS systems depend on it, is Microsoft Explorer. Whilst I have the latest version these reports have never opened properly in Explorer.

    The solution: Chrome (or some other browser). I ask people to copy the link from the Explorer web address bar after opening it from the email and simply paste it to Chrome.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust. 

    @AppliedInfoNott and @Letxuga007

  • Exact Matching in R

    Exact matching in R: a case study of the health and care costs associated with living alone, using linked data from the London Borough of Barking & Dagenham

    I’ve been working with a group of analysts in East London who are interested in joined-up health and social care data. They’ve created a powerful, unique dataset that shows how each resident of the London Borough of Barking & Dagenham interacts with NHS and council services. The possibilities are enormous, both in terms of understanding the way that people use services across a whole system, and for more general public health research.

    You can download a sample of the data here:

    Today we’re interested in whether social isolation is related to healthcare costs, and we’re going to use exact matching to explore this issue. Our theory is that people who live alone have higher healthcare costs because they have less support from family members.

    Here’s an extract from the data. ‘Total cost’ means the cost of health and social care in one year. The actual dataset has many more variables!

    ID Age groupSex Number of long-term conditionsLives aloneTotal cost
    1 50-64 Male 0 No £93
    2 65-74 Female 1 Yes £0
    3 50-64 Male 4 No £1065
    4 85+ Female 5 Yes £7210

    For the purposes of this blog, we’ll use a sample of 5000 (out of 114,000) individuals, with some changes to the values to ensure anonymity.

    Descriptive stats

    The first thing we’ll look at is whether people who live alone are different. I find descriptive tables fiddly so often use a function that can be re-used across a number of variables:

    # first, save the data provided above and change your working directory to the place you've saved it with setwd()
    # read data
    library(data.table) # useful for efficient reading and summarising of data
    d <- fread("sample_lbbd_dataset.csv")
    # describe age, sex and number of long-term conditions
    describe.by.isolation <- function(variable) {
    a <- table(unlist(d[, variable, with = F]), d$livesalone)
     p <- round(prop.table(a, 2) * 100, 0)
     matrix(paste0(a, ' (', p, ')'), ncol = 2, dimnames = list(row.names(a), c('Live with others', 'Lives alone')))
    lapply(c('age_grp', 'sex', 'LTC_grp'), describe.by.isolation)
    # describe healthcare costs
    d[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(d$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(d$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    People who live alone are older, more likely to be female and have more long-term health problems. Their mean healthcare costs are £2,347 higher. The difference in healthcare costs is visible on the histograms, which we have displayed on a log scale because some values are extremely high. There’s some ‘zero inflation’ in both groups – people with no healthcare costs who do not fit into the otherwise lognormal-ish distribution. So far not too surprising – but are the increased healthcare costs explained by the differences in age, sex and health?


    One approach would be to use regression. We could fit a linear model – this is actually not a great fit for patients with higher healthcare costs, but we won’t go into that here.

    linear_model <- lm(cost ~ livesalone + age_grp + sex + LTC_grp, d)
    plot(linear_model) # diagnostics show that a linear model is not a great fit. You might have to press return to see all the plots before you can continue.

    The results suggest that health and care costs for people who live alone are £892 more than those who do not live alone, on average. Clearly the variables we added to the model are important confounders, as this is much lower than the ‘crude’ difference of £2,347. Two limitations of this approach are that we will have to think quite carefully about the fit of the model to the data, and that we can’t describe how living alone affects the distribution of costs.

    Exact matching

    We therefore used exact matching, in which each individual who lives alone is matched to an individual who does not live alone, based on specified variables such as age group and sex. We developed a function that does this by stratifying the data and then matching ‘cases’ and ‘controls’ randomly. Unmatched individuals are deleted from the dataset, leaving a dataset that is balanced in terms of the variables you specify. Let me know here if there’s any other functionality you want and we can try and incorporate it.

    Let’s try matching on age group, sex and the grouped number of long-term conditions:

    # data = dataset containing:
    # - treatment/exposure variable 'mvar' (a string specifying variable name).
    # - matching variable 'mvar' (a string specifying variable name). If you want to match on multiple variables, concatenate them first.
    # other inputs are:
    # - ratio of cases:controls (an integer > 0)
    # - seed for fixing random selection of cases/controls (an integer; default NULL means no seed). Choice of seed is arbitrary.
    # returns data.table of matched observations, with additional variable 'id' for use in paired/grouped analyses
    smatch <- function (data, treat, mvar, ratio = 1, seed = NULL) {
     setnames(data, mvar, '.mvar')
     targ <- data[, .(case = sum(get(treat)), control = sum(!get(treat))), .mvar]
     targ[, cst := floor(pmin(control / ratio, case))]
     targ[, cnt := cst * ratio]
     targ <- targ[cst > 0]
     l2 <- cumsum(targ$cst)
     ids <- mapply(':', c(0, l2[-nrow(targ)]), l2-1)
     names(ids) <- targ$.mvar
     case <- NULL
     control <- NULL
     x <- .Random.seed
     on.exit({.Random.seed <- x})
     for(i in targ$.mvar) {
       case[[i]] <- data[get(treat) == T & .mvar == i][sample(.N, targ$cst[targ$.mvar == i])]
       case[[i]][, id := ids[[i]]]
       control[[i]] <- data[get(treat) == F & .mvar == i][sample(.N, targ$cnt[targ$.mvar == i])]
       control[[i]][, id := rep(ids[[i]], each = ratio)]
     rbindlist(c(case, control))
    # create a single variable summarising matching variables
    d$mvar <- do.call('paste0', d[, c('age_grp', 'sex', 'LTC_grp')])
    # create 1:1 matched dataset.
    matched_data <- smatch(d, treat = 'livesalone', mvar = 'mvar', ratio = 1, seed = 74)
    # check balance: same number of individuals in each group
    dcast(matched_data, age_grp + sex + LTC_grp ~ livesalone, value.var = 'id', fun.aggregate = length)

    Now we have a dataset that is balanced in terms of age, sex and the count of long-term conditions. Let’s see how healthcare costs compare:

    matched_data[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
    # histograms of cost
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(matched_data$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(matched_data$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    # t-test (in reality you might want a paired test, and to check whether a t-test is appropriate)
    t.test(cost ~ livesalone, matched_data) # notice how wide the confidence intervals are for this reduced dataset

    # proportion with costs over £10000
    matched_data[, .(over10k = sum(cost > 10000) / .N), livesalone]

    The mean difference is £803. When we used the whole dataset, this value was even closer to the coefficient from linear regression. It’s now difficult to see a difference in the histograms, but you can easily create any description of the distribution that you like – e.g. the proportion of patients that have costs over £10,000.

    Who got matched and who didn’t?

    The point of matching was to create a comparison group of people who don’t live alone who were in some ways similar to the group who do. We probably had to delete lots of people who don’t live alone in a systematic way (e.g. men and younger people who do not live alone were more likely to be deleted). We might also have deleted some of the group who do live alone, which could be more problematic if we want to generalise our results to the population. Let’s see who got deleted…

     d[, matched := ifelse(PID %in% matched_data$PID, 'matched', 'unmatched')]
    # just looking at age group for now
    compare_matched <- dcast(d, age_grp ~ livesalone + matched, value.var = 'PID')
    compare_matched[, 'TRUE_total' := TRUE_matched + TRUE_unmatched]
    compare_matched[, lapply(.SD, function(x) x/sum(x) * 100), .SDcols = 2:6]

    You can see that some of the ‘lives alone’ group got deleted (‘TRUE_unmatched’), and they were all in the older age groups. The difference between everyone who lives alone (‘TRUE_total’) and the matched group (‘TRUE_matched’) is diluted, because the oldest groups are a relatively small part of the data. Nonetheless, I would say this is a fairly important difference. If you are concerned about generalisability to the population, you might want to restrict the analysis to people aged under 85. In the full dataset this was not be a problem (as there were lots more potential ‘controls’), but you might encounter similar issues if you match on more detailed variables.

    Final point! We’ve focused on a technical solution to exact matching. We haven’t really thought about which variables we should be matching on. This is a matter of judgement, and needs a bit of thinking before we dive into the matching process. Just like covariates in regression, the matching variables are confounders – they’re alternative explanations for an association between living alone and healthcare costs. Age and sex are clearly important. Health status is more complex. Do you want to think about long-term conditions as something that cause people to live alone, or the product of living alone? If people are sicker because they live alone, matching on the number of long term conditions might falsely reduce the association between living alone and healthcare costs.

    This blog was written by Dan Lewer, NIHR Doctoral Research Fellow / SpR Public Health. Department of Epidemiology & Public Health, UCL

    Twitter: danlewer

  • Count of working days function

    It’s at this time of year I need to renew my season ticket and I usually get one for the year. Out of interest, I wanted to find out how much the ticket cost per day, taking into account I don’t use it on weekends or my paid holidays. I started my workings out initially in Excel but got as far as typing the formula =WORKDAYS() before I realised it was going to take some working out and perhaps I should give it a go in R as a function…

    @ChrisBeeley had recently shown me functions in R and I was surprised how familiar they were as I’ve seen them on Stack Overflow (usually skimmed over those) and they are similar to functions in SQL which I’ve used (not written) where you feed in parameters. When I write code I try to work out how each part works and build it up but writing a function requires running the whole thing and then checking the result, the objects that are created in the function do not materialise so are never available to check. Not having objects building up in the environment console is one of the benefits of using a function, that and not repeating scripts which then ALL need updating if something changes.

    Bus ticket function

    This is the final function which if you run you’ll see just creates a function.

    #Week starts on Sunday (1)
    DailyBusFare_function <- function(StartDate,EmployHoliday,Cost,wfh){
      startDate <- dmy(StartDate)
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <-seq(from = startDate, to = endDate, by = "days")
      working_days <- sum(wday(myDates)>1&wday(myDates)<7)-length(holidayLONDON(year = lubridate::year(startDate))) - EmployHoliday - wfh
    per_day <- Cost/working_days

    Running the function you feed in parameters which don’t create their own objects:


    [1] 2.707965

    Going through each line:

    To make sure each part within the function works it’s best to write it in another script or move the bit betweeen the curly brackets {}.

    Firstly, the startDate is self explanatory but within the function I’ve set the endDate to be dependent upon the startDate and be automatically 1 year later.

    Originally when I was trying to find the “year after” a date I found some documentation about lubridate’s function dyear:

    #Next couple of lines needed to run the endDate line!
    startDate <- dmy("11/07/2019")
    endDate <- startDate + dyears(1)

    but this gives an exact year after a given date and doesn’t take into account leap years. I only remember this because 2020 will be a leap year so the end date I got was a day out!

    Instead, @ChrisBeeley suggested the following:

    endDate <- as.Date(startDate) %m+% months(12)

    Next, the code builds a sequence of days. I got this idea of building up the days from the blogs on getting days between two dates but it has also come in use when plotting over time in things like SPCs when some of the time periods are not in the dataset but would make sense appearing as 0 count.


    #To run so that the sequencing works
    #using as.Date() returns incorrect date formats 0011-07-20 so use dmy from
    #lubridate to transform the date
      startDate <- dmy("11/07/2019")
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <- seq(from = startDate, to = endDate, by = "days")

    All of these return values so trying to open them from the Global Environment won’t do anything. It is possible view the first parts of the values but also typing:

    #compactly displays the structure of object, including the format (date in this case)

    Date[1:367], format: “2019-07-11” “2019-07-12” “2019-07-13” “2019-07-14” “2019-07-15” …

    #gives a summary of the structure
    Min.      1st Qu.       Median         Mean      3rd Qu.

    “2019-07-11” “2019-10-10” “2020-01-10” “2020-01-10” “2020-04-10” Max. “2020-07-11”

    To find out what a function does type ?str or ?summary in the console. The help file will then appear in the bottom right Help screen.

    Next I worked out working_days. I got the idea from a blog which said to use length and which:

    working_days <- length(which((wday(myDates)>1&wday(myDates)<7)))

    Note that the value appears as 262L which is a count of a logical vector. Typing ?logical into the Console gives this in the Help:

    Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer._

    I was familiar with “length”, it does a count essentially of factors or vectors (type ?length into the Console for information) but “which” wasn’t something I knew about. @ChrisBeeley explained what which does with the following example:

    #Generate a list of random logical values
    a <- sample(c(TRUE, FALSE), 10, replace = TRUE)
    #Look at list


    #using which against the list
    gives the number in the list where the logic = TRUE

    [1] 4 5 8

    #counts how many logical
    arguments in the list (should be 10)

    [1] 10

    #counts the number of TRUE
    logical arguments

    [1] 3

    Then @ChrisBeeley suggested just using “sum” instead of length(which()) which counts a logical vector:


    [1] 3

    It seems this has been discussed on Stack Overflow before: https://stack overflow.com/questions/2190756/how-to-count-true-values-in-a-logical-vector

    It’s worthy of note that using sum will also count NAs so the example on Stack Overflow suggest adding:

    sum(a, na.rm = TRUE)

    [1] 3

    This won’t affect the objects created in this blog as there were no NAs in them but it’s just something that could cause a problem if used in a different context.

    working_days <- sum(wday(myDates)>1&wday(myDates)<7)
    #Just to check adding na.rm = TRUE gives the same result
      working_days <- sum(wday(myDates)>1&wday(myDates)<7, na.rm = TRUE)

    I then wanted to take into account bank/public holidays as I wouldn’t use the ticket on those days so I used the function holidayLONDON from the package timeDate:

    length(holidayLONDON(year = lubridate::year(startDate)))

    [1] 8

    I used lubridate::year because the package timeDate uses a parameter called year so the code would read year = year(startDate) which is confusing to me let alone the function!

    Again, I counted the vectors using length. This is a crude way of getting bank/public holidays as it is looking at a calendar year and not a period (July to July in this case).

    I did look at a package called bizdays but whilst that seemed to be good for building a calendar I couldn’t work out how to make it work so I just stuck with the timeDate package. I think as I get more confident in R it might be something I could look into the actual code for because all packages are open source and available to view through CRAN or GitHub.

    Back to the function…

    I then added “- EmployHoliday” so I could reduce the days by my paid holidays and “- wfh” to take into account days I’ve worked from home and therefore not travelled into work.

    The next part of the code takes the entered Cost and divides by the Working_days, printing the output to the screen:

    per_day <- Cost/working_days


    And so the answer to the cost per day is printed in the Console:


    [1] 2.707965

    A conclusion… of sorts

    Whilst this isn’t really related to the NHS it’s been useful to go through the process of producing a function to solve a problem and then to explain it, line by line, for the benefit of others.

    I’d recommend doing this to further your knowledge of R at whatever level you are and particularly if you are just learning or consider yourself a novice as sometimes blogs don’t always detail the reasons why things were done (or why they were not done because it all went wrong!).

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @AppliedInfoNott and @Letxuga007

    The R Markdown file is available here: