• NHS-R Member Profile – Robin Hinks

    R is a powerful tool for manipulating health and care data and a lot can be learned from sharing our experiences of using R with others. We bring to you an NHS-R profile from one of our Community members, to share their insider knowledge of using R…

    Robin Hinks
    Research and Policy Officer
    Chartered Society of Physiotherapy

    How did you first meet R?
    While working as a civil service analyst, where I was encouraged to undertake self-directed learning in r to conduct statistical and geographic analysis.
    What sort of things do you use R for and what do you love about R?
    Through previous roles – where I have done quality assurance and validation of other research teams’ work – I know the value of well-documented analytical process, and the dangers of poor record keeping! I love how r allows you to keep all your analysis, research notes and – through r Markdown – reporting in one place.
    What do you hate about R?
    I have a qualitative research background and using r has been my first real exposure to code development. While I found the move from, say, SPSS’ ‘point and click’ environment easy enough, I have found it difficult to get my head round the wider world of code development that surrounds r: learning about pulls, commits, splits and the like has been challenging!
    What are your top tips for using R?
    Start by coding up some tasks you already have a process for elsewhere – e.g. automating some data transformations you’ve previously done in a programme like SPSS or Excel. Working out how to translate a task into r’s environment is a lot easier that starting from a blank slate.
    Can you please name a project where you have used R? Briefly describe what this involves.
    Health data and statistics are reported against a range of geographies that do not easily match up with the political geographies our members might seek to influence – e.g. parliamentary constituencies or local authorities. I’ve used r to develop look up tables between different geographic areas; and using the leaflet package visually map different geographic area,
    developing simple choropleth and point maps for internal insight work.

  • SPC Charting in R

    Divisional Information Specialist, Worcestershire Acute Hospitals NHS Trust

    By Christopher Reading


    For some time the Information Department at Worcestershire Acute NHS Trust has been making use of statistical process control (SPC) charts for internal reporting purposes. This has formed part of our drive toward better decision-making (as recommended in NHSI’s Making Data Count https://improvement.nhs.uk/resources/making-data-count/).

    In doing so we have made extensive use of NHSI’s Excel-based SPC tool and have also sought to implement this methodology outside of the confines of MS Excel (ie. within our SQL/SSRS based reporting suite).

    As the Department’s unofficial ‘R Champion’, I have driven efforts to increase my team’s knowledge and usage of R over the last six months. My experience with NHSI’s resources suggested that R offered a route to more efficient, consistent and quickly reproducible SPC charting. I set about developing a charting function within R which would replicate NHSI’s logic and methodology[1].

    I developed and tested a custom function in R which requires two primary arguments: a series of data points, and a set of accompanying date values. The function then creates a data frame containing the data series, its mean and moving average values, and upper and lower control limits. The series is then tested against NHSI’s methodology, and special cause variations are highlighted and colour coded. This formatting is done according to a secondary function argument which identifies whether an increase or decrease in the series indicates system improvement. This data frame is then visualised using ggplot, which displays the SPC and any additional information such as a performance trajectory or national target.

    I then tested the function and compared against our existing SPC reporting. A few logical ‘gremlins’ in the function were identified and subsequently removed, and once I was happy with the function it was integrated into a growing departmental R package (currently only internally available) for use in R Markdown reporting and our expanding R Shiny dashboard repertoire.

    My next step was to use Shiny to create an SPC Wizard app, to enable colleagues without R knowledge to test and utilise the SPC function. The app allows users to supply CSV files containing multiple data series, and generate SPC charts with little or no effort. These can then be exported as image files for Trust reporting. The app allows users to make formatting changes to the chart such as customising main and axis titles, customising the frequency of axis labels and size of point and line geoms (chart objects) for lengthy data series. It also allows users to specify multiple data series at a time to create ‘small multiple’ SPC charts for simultaneous analysis.

    The project provided an excellent challenge in developing my Shiny skills, and provided an opportunity to utilise the visually impressive and professional appearance of the ShinyDashboard package. Development of this Shiny app also led to a challenging project of setting up a Linux based Shiny server, to allow hosting of the app for colleagues to use.

    A key advantage of this function-based approach is that the SPC methodology is now available for use by all analysts within the Department, and can be implemented with a minimum of coding. One of the primary difficulties with SQL based SPC logic encountered by our team was the length of code required to produce the chart data, and therefore the increased risk of error when recycling this code for different reports. The simplicity and self-contained nature of the SPC function avoids this.

    Having successfully tested and embedded the SPC function within an ad-hoc SPC wizard, I have continued to develop a Shiny Performance Dashboard for Cancer KPIs. This rapidly produces SPC charting for 2-Week-Wait Referral and 62-Day Cancer Treatment metrics from live data pulled from our SQL warehouse. I hope this will be the first of many dashboards to take advantage of an easily available and consistent SPC methodology, allowing our Department to create reports and dashboards which are better able to communicate the nature of changing time series to Trust decision-makers, and to track and evaluate the impact of operational management decisions.

    Despite the (at times steep!) learning curve involved, from creating the initial function and replicating NHSI’s SPC logic, to setting up the Shiny server and deploying apps for use, this project has been an excellent way to develop my R skills and to demonstrate the value in embedding use of R within our organisation, and making it part of our toolkit for ‘business as usual’ analysis.

    I hope that next steps for this project will be sharing our methodology with other NHS organisations, to allow further input and development of the methodology and reporting applications. Recently there have been discussions around a collaboration with other NHS Trusts and the Strategy Unit, regarding the possibility of developing an SPC package and shiny app to be available to all NHS organisations. If you would like to learn more or take part in the discussions, please join us on the NHS-R community slack channel (nhsrcommunity.slack.com) and let us know your thoughts on an SPC package, and what you might want to see as part of it!


    [1] For those not familiar with the Making Data Count resources, the SPC tool is based around a moving average measurement of sigma and significant variations in data based on the this value. These include the identification of any data points above or below three sigma; sequences of consecutive data points above/below the mean; runs of consecutively increasing/decreasing data points; and two out of three data points at greater (or less than) than 2 sigma.

  • Don’t Repeat Yourself!

    Writing functions to reduce repetition and improve productivity

    One of the greatest benefits of using R over spreadsheets is that it’s very easy to re-use and repurpose code, for example if we need to produce the same chart over and over again, but for different cuts of the data.

    Let’s imagine that we are trying to create a plot for arrivals to A&E departments using the ae_attendances dataset from the NHSRdatasets package.

    Creating our first plot

    First we want to create a plot for all of England’s A&E departments over the last 3 financial years.

    library(tidyverse)
    library(NHSRdatasets)
    
    ae_attendances %>%
      group_by(period) %>%
      # summarise at is a shorthand way of writing something like
      #   summarise(column = function(column))
      # first you specify the columns (one or more) in the vars() function (short
      # for variables), followed by the function that you want to use. You can
      # then add any additional arguments to the function, like below I pass
      # na.rm = TRUE to the sum function.
      summarise_at(vars(attendances), sum, na.rm = TRUE) %>%
      ggplot(aes(period, attendances)) +
      geom_point() +
      geom_line() +
      scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
      labs(x = "Month of Attendance",
           y = "Number of Attendances",
           title = "Attendances to A&E Departments by Month",
           subtitle = "All A&E departments in England")

    Creating a second plot

    Now, what if we wanted to run this for just a single trust? We could copy and paste the code, then add in a filter to a specific trust.

    # of course, you would usually more specifically choose which organisation we
    # are interested in! Selecting the first organisation for illustrative purposes.
    # The pull function grabs just the one column from a data frame, we then use
    # head(1) to select just the first row of data, and finally ensure that we
    # convert this column from a factor to a character
    first_org_code <- ae_attendances %>%
      pull(org_code) %>%
      head(1) %>%
      as.character()
    
    ae_attendances %>%
      filter(org_code == first_org_code) %>%
      group_by(period) %>%
      summarise_at(vars(attendances), sum) %>%
      ggplot(aes(period, attendances)) +
      geom_point() +
      geom_line() +
      scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
      labs(x = "Month of Attendance",
           y = "Number of Attendances",
           title = "Attendances to A&E Departments by Month",
           subtitle = paste("org_code =", first_org_code))

    So, what changed between our first plot and the second? Well, we’ve added a line to filter the data, and changed the subtitle, but that’s it. The rest of the code is repeated.

    Creating yet another copy of the first plot

    Let’s say we want to run this code again and create a plot for another organisation. So again, let’s copy and paste.

    But perhaps at this point we also decide that we want the label’s on the y-axis to use comma number formatting, we want to change the dots and lines to bars, and we want to colour the bars in NHS Blue.

    # the scales package has nice functions for neatly formatting chart axes
    library(scales)
    
    # again, just selecting an organisation for illustrative purposes only.
    # This time, we use tail instead of head to select the final row
    second_org_code <- ae_attendances %>%
      pull(org_code) %>%
      tail(1) %>%
      as.character()
    
    ae_attendances %>%
      filter(org_code == second_org_code) %>%
      group_by(period) %>%
      summarise_at(vars(attendances), sum) %>%
      ggplot(aes(period, attendances)) +
      geom_col(fill = "#005EB8") +
      scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
      scale_y_continuous(labels = comma) +
      labs(x = "Month of Attendance",
           y = "Number of Attendances",
           title = "Attendances to A&E Departments by Month",
           subtitle = paste("org_code =", second_org_code))

    Now, we want to go back and change the rest of the plots to have the same look and feel. Well, you will have to go back up and change those plots individually, which when there’s just 3 plots then so what? It’s easy enough to go back and change those!

    But what if it’s 300 plots? Or, what if those 3 plots are in 3 different places
    in a very large report? What if those 3 plots are in separate reports? What if it wasn’t just a handful of lines code we are adding but lots of lines?

    Creating functions

    This is where we should start to think about extracting the shared logic between the different plot’s into a function. This is sometimes called “DRY” for “Don’t Repeat Yourself”. Where possible we should aim to eliminate duplication in our code.

    In R it’s pretty simple to create a function. Here’s a really simple example:

    my_first_function <- function(x) {
      y <- 3*x
      y + 1
    }

    This creates a function called my_first_function: you assign functions just like any other variable in R by using the <- assignment operator. You then type the keyword function which is immediately followed by a pair of parentheses. Inside the parentheses you can name “arguments” that the function takes (zero or more), then finally a set of curly brackets, { and }, which contain the code you want to execute (the function’s body).

    The functions body can contain one or more lines of code. Whatever line of code is executed last is what is returned by the function. In the example above, we first create a new variable called y, but we return the value of y+1.

    The values that we create inside our function (in this case, y) only exist within the function, and they only exist when the function is called (so subsequent calls of the function don’t see previous values).

    We can then simply use our function like so:

    my_first_function(3)

    Which should show the value “10” in the console.

    Converting our plot code to a function

    The first thing we should look to do is see what parts of the code above are identical, which parts are similar but change slightly between calls, and which parts are completely different.

    For example, in our plot above, each example uses the same data summarisation, and the same call to ggplot. We slightly changed how we were displaying our charts (we started off with geom_point and geom_line, but changed to geom_col in the third plot). Let’s go with the chart used in the third version as our base plot.

    The subtitle’s differ slightly between the 3 plots, but we could extract this to be an argument to the function. So my first attempt at converting this plot to a function might be:

    ae_plot <- function(data, subtitle) {
      data %>%
        group_by(period) %>%
        summarise_at(vars(attendances), sum) %>%
        ggplot(aes(period, attendances)) +
        geom_col(fill = "#005EB8") +
        scale_x_date(date_breaks = "6 months", date_labels = "%b-%y") +
        scale_y_continuous(labels = comma) +
        labs(x = "Month of Attendance",
             y = "Number of Attendances",
             title = "Attendances to A&E Departments by Month",
             subtitle = subtitle)
    }

    We can now create our first 3 plots as before:

    ae_plot(ae_attendances, "All A&E departments in England")
    # as ae_plot's first argument is the data, we can use the %>% operator to pass in the data like so:
    ae_attendances %>%
       filter(org_code == first_org_code) %>%
       ae_plot(paste("org_code =", first_org_code))
    ae_attendances %>%
       filter(org_code == second_org_code) %>%
       ae_plot(paste("org_code =", second_org_code))

    Now, we’ve managed to remove most of the duplication in our code! If we decide we no longer like the blue points and line we can easily change the function, or if we want to switch to a bar chart instead of the line chart we only have to update the code once; when we re-run our code all of the plots will change.

    Of course, this leads to it’s own problems: what if we want 3 charts to have blue points but one use red? We could either add a colour argument to the function, or we could remove the logic which adds the points and lines to the chart but does everything else: then we could just add the points on at the end (or, create a red function and a blue function; each function would first call the main function before doing their own stuff).

    In Summary

    Functions allow us to group together sections of code that are easy to reuse, they make our code easier to maintain, because we only have to update code in one place, and they reduce errors by limiting the amount of code we have.

    Any time you see yourself copying and pasting code try to remember, Don’t Repeat Yourself!

    Further Reading

    This file was generated using RMarkdown, you can grab the .Rmd file here.

    Hopefully this has been a useful introduction to functions, if you are interested in learning more then the R4DS book has an excellent chapter on functions.

    Once you have mastered writing functions then you might want to read up on tidyeval: this allows you to write functions like you find in the tidyverse where you can specify the names of columns in dataframes.

    You may also want to have a go at object orientated programming, which is covered in the Advanced R book.

  • How NHS-R Community do The Apprentice…

    By Zoe Turner

    One of the tasks on the Apprentice a number of years ago was for the contestants to put on a corporate event, at no small cost to the people attending I might add. It’s a tale often told because one of the contestants was gluten free and no one had accounted for dietary needs amongst the contestants so the poor ‘gluten free lady’, as she was known, was served a fruit salad.

    The reason I’m apparently going off tangent so early in a blog, is that it struck me that the Apprentice is all about throwing people in at the deep end and seeing how they cope. It’s entertainment but clashes with the premise that these are potential apprentices to a ‘British business magnate’ (as Wikipedia calls him). Contrast this with NHS-R and how I came to be attending the Train the Trainer event at the end of 2019 and then helped to run the first of 3 local courses this January, having only just started learning R around 2 years ago.

    Analysts have many expectations made of them. They have to be technical, able to interpret data and communicate themselves clearly to non-analysts. Very rarely though will an analyst be expected to train others. Some may produce or present some training to support or mentor fellow analysts, and even then my experience has always been on the receiving end. Coupled with the fact that I’ve never really had a burning desire to teach, it was a surprise to find myself on a course on how to deliver the NHS-R ‘Introduction to R’ workshop.

    The reason I did it is that my involvement with NHS-R has led to this natural consequence of training others. I started with attending the course myself, then presented at the conference and facilitated an Introduction Course run by NHS-R but hosted by my Trust. I then didn’t hesitate in agreeing to taking on the training.

    NHS-R Community held their first two-day Train the Trainer event in Birmingham organised through AphA (Association of Professional Healthcare Analysts). I was supported to go on this by my manager, Chris Beeley, who is a huge advocate of R and Shiny. Whilst he himself has run several workshops over the years I, notably, have run zero!

    At the TtT (got to get an acronym in there) I had the opportunity to meet lots of motivated people from around the British Isles who were as keen as I was, not only to learn how to teach R but also to talk about data – that happened quite a bit in the breaks. We had an opportunity to present to each other, and that was useful as I learn especially from watching others. Everyone has their own style and it gets perfected over time but I was hugely impressed by how engaging people were and how quickly they could read about a subject that was new to them (we looked at the RStudio presentation slides https://education.rstudio.com/teach/materials/) and then go on to express clearly what they’d just learned.

    I could go on about what I learned at the course, but the proof of its value is in what I did with it. And so on 17th January, Chris and I held a workshop for 15 people from various organisations; some had travelled as far as London and Portsmouth, such is the commitment to learn R. Chris led the workshop and I did the live coding/support role which took the edge off that ‘first time’ feeling.

    This idea of having at least two people in a workshop is a good one, even when the trainer is very experienced. Chris, for example, is very used to running workshops alone, but inevitably people get stuck or things don’t work as they should and so I did the helping while Chris moved between training and coding. It felt, to me at least, like the burden was shared. It helped to ensure that no-one in the group was allowed to fall behind so far that they just gave up.

    Chris and I had gone through the slides beforehand as he’d not gone on the TtT course, and having not written them himself, I wanted to make sure he’d know what was covered. What reassured me was that, as he presented the course, there wasn’t a part of it that I didn’t understand myself and couldn’t cover if I had to take over at that moment. And so the day wasn’t as nerve-racking as I anticipated, and I had fun – which must have been noticeable to the group, as I had an email commenting on how we are clearly a happy team!

    Whilst I haven’t actually run a workshop, I think the process I’ve gone through to get to this point has certainly built up my confidence to do it. I’ve taken every opportunity NHS-R community has offered, from doing the introduction to presenting at the conference, and so this next thing – to run the training myself – hasn’t been so scary after all. I feel like a happy and well-supported apprentice of NHS-R, and the great thing about NHS-R in this is that everyone can be an apprentice too – you just have to get involved.

    Being open-source, all the slides for the trainer course and the introduction course are available on GitHub:

    Train the Trainer course materials can be found at:

    https://github.com/nhs-r-community/train_the_trainer

    Course materials for the underlying Intro to R course are found at:

    https://github.com/nhs-r-community/intro_r

    Zoë Turner, Senior Information Analyst, Nottinghamshire Healthcare NHS Foundation Trust.

    @Letxuga007 and curator of @DataScienceNotts (formerly known as @AppliedInfoNotts)

  • Forecasting R

    By Bahman Rostrami-Tabar

    Democratising forecasting project

    The initiative sponsored by the International Institute of Forecasters provides cutting-edge training in the use of forecasting with R software in developing countries. There is no doubt that many people in developing countries cannot afford high fees to attend forecasting workshops. The project aims to make such a training accessible to those people and provide up-to-date training on the principles of forecasting and create a network of forecasters to conduct research on forecasting with social impact for less developed countries. The training has been delivered in Tunisia, Iraq, Senegal, Uganda, Nigeria, Turkey, Indonesia so far.

    We expand the “Democratising Forecasting” initiative to deliver four forecasting workshops per year for the National Health Service (NHS), UK, in collaboration with NHS-R community. We give our time for free despite the debate that such organisations may afford to pay such trainings.

    Why forecasting for NHS

    Ensuring life-saving services are delivered to those who need them require far more than money, infrastructure and scientific progress. Accurate modelling and forecasting systems can assist critical decisions that drive policy making, funding, research, and development in the NHS. Decision makers are making decisions every day with or without forecasts. However, they are more robust and well-informed in the light of what could happen in the future, and that is where forecasting becomes crucial. However, implementing forecasting principles to support decision-making process requires significant technical expertise. To that end, we aim to organize a two-day workshop on forecasting to teach participants how to apply principles of accurate forecasting using real data in healthcare.

    We are offering four workshops in 2020 as per below:

    • Monday 24th & Tuesday 25th February 2020 – Nottinghamshire – Workshop fully booked
    • Monday 22nd June 2020 – Huddersfield – Online https://www.eventbrite.co.ukworkshop is fully booked
    • Thursday 13th & Friday 14th August 2020 – Gloucestershire
    • Wednesday 13th & Thursday 14th October 2020 – West Midlands
    • Thursday 12th & Friday 13th November 2020 – Cardiff

    Workshop instructor

    Dr. Bahman Rostami-Tabar is the main instructor for the workshops. Colleagues from other Universities in the UK might join him occasionally. Bahman is a Senior Lecturer (Associate Professor) in management science at Cardiff Business School, Cardiff University, UK. He is interested in the use and the Implication of forecasting in social good areas such as health and humanitarian. He has been involved in various forecasting related projects with NHS Wales, Welsh Ambulance Service Trust and the International Committee of the Red Cross. He is currently working on projects that focus on developing innovative forecasting methodologies and their links to decision making in health and humanitarian operations.

    Who should attend

    This workshop is for you if you are:-

    • A decision maker who wants to use forecasting tools and techniques using R to empower decision making;
    • A data analyst who wants to gain in depth understanding of forecasting process.
    • A Forecaster who wants to learn how to use R software for forecasting purpose.

    What participants will learn in the workshop

    Assuming basic knowledge of statistics, participants will be able to do the following tasks when using forecasting to support decision making process in real World:

    • Determine what to forecast according to a forecasting process framework;
    • Prepare and manipulate data using functions in basic R, tidyverse and lubridate packages in R;
    • Identify systematic patterns using time series toolbox in ggplot2 and forecasting related packages in R such as forecast package;
    • Produce point forecasts and prediction intervals using R functions in forecasting related packages such as forecast and fable and user defined functions;
    • Determine the accuracy of forecasting models using statistical accuracy performance measures for point and prediction intervals;
    • Visualize, export and report result for interpretation and insights using RMarkdown.

    Requirement:

    • Basic knowledge in statistics is assumed, e.g. “I know what normal distribution is”;
    • Basic knowledge of R is assumed, e.g. “I know what data type and data structure is”, “I know how to use a function”;
    • No knowledge of forecasting is assumed;

    Program

    Start: 09:30 a.m.

    End: 04:30 p.m.

    Refreshment breaks:

    • Morning: 11:00 – 11:20
    • Afternoon: 03:00 – 03:20 p.m.

    Lunch: 12:30 p.m. – 01:30 p.m.

    Webinar:

    Title: Essentials to do forecasting using R

    Date: Two weeks before the workshop (TBC)

    Day 1

    1.1.Forecasting and decision making: What is forecasting? How forecasting task is different from other modelling tasks? What is the link between forecasting and decision making, how to identify what to forecast?

    1.2. Data preparation, and manipulation: how to prepare data for the forecasting task, how to clean data? How to manipulate data to extract time series?

    1.3. Time series patterns and decomposition: what could be used in data for forecasting task? how to detect systematic pattern in the data? how to separate non-systematic pattern?

    1.4. Forecaster’s toolbox: How to use time series graphics to identify patterns?  what is a forecasting benchmark? What are the simple forecasting methods that could be used as benchmark? How to generate point forecasts and prediction interval using simple forecasting methods?

    1.5. Forecast accuracy evaluation: How do we know if the forecasting method captures systematic patterns available in data? How to judge whether a forecast is accurate or not? How to evaluate the accuracy of point forecasts and prediction interval? Why do we need to distinguish between fitting and forecast?

    1.6. Exponential smoothing models: What is the exponential smoothing family? what are available models in this family? What is captured by this family? how to generate point forecast and prediction intervals using exponential smoothing models?

    Day 2

    2.1. ARIMA models: This is another important family of forecasting models. What is the ARIMA framework? what are available models in this family? What is captured by this family? how to generate point forecast and prediction intervals using ARIMA models? 

    2.2. Regression: We also look at causal techniques that consider external variables. What is the difference between regression and exponential smoothing and ARIMA? How to build a simple regression model?

    2.3. Special events: In addition to using systematic patterns in time series, we discuss how to include deterministic future special events, such as holidays, festive days, etc in models.

    2.4. Forecasting by aggregation:  How to forecast in a situation where we have a high frequency time series, e.g. daily but we need a low frequency forecast, e.g. monthly?  How do we generate forecast for items with a hierarchical or grouped time series structure?

    2.5. Forecasting by combination:  how to use an ensemble of forecasting approaches? In which conditions ensemble forecasts are better than individual methods?

    2.6. Forecasting for many time series: what is the best approach to forecast many time series? How to classify items for forecasting? How to forecast them? How to report the accuracy?  

    Reference:

    Workshop Booklet:

    • Materials will be provided for the workshop in RMarkdown.

    Books:

    Bahman Rostami-Tabar

    Associate Professor in Management Science, Cardiff University, UK

    www.bahmanrt.com

  • NHS-R Community Conference II

    My journey to work takes me about an hour and a half, and I catch a couple of buses with Wi-Fi which means I can browse Twitter and invariably end up with hundreds of tabs open as I flit between articles and blogs. Most mornings I find it hard to concentrate on reading through entire articles, especially the really long ones, so I leave the tab open on my computer, often for days, before reading them. Given my experience of reading blogs, why would anyone want to read through mine about the NHS-R Community conference?

    If I’d gone to the conference I’d probably skim that paragraph thinking ‘yes, I went, I know how good it was’.

    If I’d not gone to the conference I’d probably skim that paragraph because I might prefer not to know just how great a conference was when I’d missed it!

    Even though the conference was moved to a bigger location to accommodate more people and around 250 people attended, I have still spoken to people who didn’t get a ticket or missed submitting an abstract to speak. People who never made the conference are talking about an event that is only in its 2nd year. What is going on? What is it that has made the event so successful?

    Organising an event of any size takes a lot of work and that is often overlooked. There were the core people who did the real work – the arrangements – and quite frankly, they made it look easy, which itself is an indication of how hard they worked. But there were others who were part of a committee that chipped in with bits they could help with: setting up a specific email; reading through abstracts; suggesting things the organisers might consider, like how to ensure diversity of questioners (https://oxfamblogs.org/fp2p/how-to-stop-men-asking-all-the-questions-in-seminars-its-really-easy/).

    That organising committee was made up from a group who have shown a particular interest in R, and as such I found myself part of that group. Now although I have submitted a few blogs to NHS-R, I only really started using R a couple of years ago. Deep down I’m still a SQL analyst and my contributions to the conference were pretty minimal, but I feel encouraged to make those small contributions (even that last one about who gets to ask the first question in seminars) and each small involvement builds up to a bigger thing. This really is feeling like an equal and inclusive group and that’s where I think this success is coming from.

    It may have been by design or it may be a happy accident but there is a crucial clue in the name of this group that gives away its success – Community. This conference wasn’t managed top-down. There are some key people, of course, but they are as much of this Community as the people who contribute to the blogs, those that stood up on stage and showed their work, have those that will be learning to run the R Introduction training. This is our NHS-R Community.

    If you missed this year’s conference and want to go to the next one, get involved. The more people involved, the less work there is for everyone individually. Plus, given that tickets this year ran out in just 2 hours, you’ll be more likely to secure yourself a ticket.

    Speaking of which, provisional dates for the next conference are the 2nd and 3rd November 2020 (Birmingham). Now aren’t you glad you read this blog!

    Zoë Turner, Senior Information Analyst @AppliedInfoNott @Letxuga007

  • NHS Number Validation

    NHS Numbers – the backbone of the NHS

    NHS Numbers are the best. They are numbers that follow you through life, from birth to death, changing only with certain situations and are easy to check for validity from just the numbers themselves.

    What? You can check to see if a number is a genuine NHS Number without any other information?!

    Now, I’ve worked in the NHS for more years than I care to count and I never realised there was an algorithm to these numbers. I stumbled across this fact through Twitter of all places. Yes, Twitter, the place you’ll find NHS-R enthusiasts tweeting things like this:

    https://twitter.com/sellorm/status/1171858506057682944 which I saw retweeted by @HighlandR (John MacKintosh).

    I’ve only just started doing things on GitHub but @Sellorm wasn’t very close in saying there may be millions of packages as there are, in fact there are only 2 for R, but surprisingly, not a single one in SQL.

    I installed the package, had a quick play around and looked at the code on GitHub. The downside for this package, for me, was that you need to feed in your data using vectors and I like using data tables. Plus, I didn’t necessarily want a list of outputs like TRUE, TRUE, FALSE, TRUE but I wanted to see the NHS numbers that aren’t valid. Still, I wouldn’t have got so far, so quickly, without @Sellorm’s code, excellent notes and thoroughly written Readme file and so the moral of the story is, even if the package doesn’t work for a task or problem, you may still be able to use parts of the code.

    The Readme on https://github.com/sellorm/nhsnumber, which you can see when you scroll down past all of the file directory looking things, included a link to the wiki page on NHS Numbers and its algorithm check https://en.Wikipedia.org/wiki/NHS_number. Again, I had no idea this existed!

    Wiki, like R, is open sourced and one of the links was out of date. I’d found a really useful document on NHS Numbers a while back https://www.closer.ac.uk/wp-content/uploads/CLOSER-NHS-ID-Resource-Report-Apr2018.pdf, so, in the spirit of open source, I updated the Wiki page. Proud moment!

    The algorithm is quite simple and another package on GitHub https://github.com/samellisq/nhsnumbergenerator generates numbers using it but I didn’t spend too long on these packages as I decided to do my own script, nothing fancy, no loops, functions or packages…

    library(tidyverse)
    
    Validity <- df %>% 
      mutate(length = nchar(NHSNumber),
             A = as.numeric(substring(NHSNumber,1,1)) *10,
             B = as.numeric(substring(NHSNumber,2,2)) *9,
             C = as.numeric(substring(NHSNumber,3,3)) *8,
             D = as.numeric(substring(NHSNumber,4,4)) *7,
             E = as.numeric(substring(NHSNumber,5,5)) *6,
             G = as.numeric(substring(NHSNumber,6,6)) *5,
             H = as.numeric(substring(NHSNumber,7,7)) *4,
             I = as.numeric(substring(NHSNumber,8,8)) *3,
             J = as.numeric(substring(NHSNumber,9,9)) *2,
             End = as.numeric(substring(NHSNumber,10,10)),
             Total = A+B+C+D+E+G+H+I+J,
             Remainder = Total %% 11,
             Digit = 11- Remainder,
             Summary = case_when(Digit == 10 ~ 999,
                                 Digit == 11 ~ 0,
                                 TRUE ~ Digit),
             Valid = case_when(Summary == End & length == 10 ~ TRUE,
                               TRUE ~ FALSE
                               )) %>% 
      filter(Valid == FALSE)
    

    Importing data

    Data is imported into R via Excel or, in my case, I used a SQL connection which worked better as a couple of our systems hold many hundreds of thousands of NHS Numbers and I wanted to check them all. Excel might have got a bit of indigestion from that. Also, it meant I wasn’t storing NHS Numbers anywhere to then run through R. My systems are secure but it’s always worrying having such large amounts of sensitive data in one place and outside of an even securer system like a clinical database.

    Data doesn’t always import into R the same way and for mine I needed to remove NULLs and make the NHSNumber column, which was a factor, into a character and then numeric format:

    df <- data %>% 
      filter(!is.na(NHSNumber)) %>% 
      mutate(NHSNumber = as.numeric(as.character(NHSNumber)))
    

    Factors

    Factors are a new thing for me as but, as I understand it, they put data into groups but in the background. For example, if you had male, female and other these would be 3 factors and if you join that to another data set which doesn’t happen to have “other” as a category the factor would still linger around in the data, appearing in a count as 0 as the table still has the factor information but no data in that group.

    To manipulate factor data I’ve seen others change the format to character and so that’s what I did. I’ve done similar things in SQL with integers and dates; sometimes you have to change the data to an ‘intermediate’ format.

    Validity code – explanation

    This code uses dplyr from tidyverse to add columns:

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber), 
    

    An NHS number must be 10 characters and nchar() reminds me of LEN() in SQL which is what I would use to check the length of data.

    One thing I didn’t code, but I guess people may want to check, is that all characters are numeric and no letters have been introduced erroneously. That’s something to consider.

         A = as.numeric(substring(NHSNumber,1,1)) *10,
         B = as.numeric(substring(NHSNumber,2,2)) *9,
         C = as.numeric(substring(NHSNumber,3,3)) *8,
         D = as.numeric(substring(NHSNumber,4,4)) *7,
         E = as.numeric(substring(NHSNumber,5,5)) *6,
    

    I didn’t use F as it’s short for FALSE which changed the meaning and colour of the letter. It can be used as a column name but I missed it out for aesthetics of the script!

         G = as.numeric(substring(NHSNumber,6,6)) *5,
         H = as.numeric(substring(NHSNumber,7,7)) *4,
         I = as.numeric(substring(NHSNumber,8,8)) *3,
         J = as.numeric(substring(NHSNumber,9,9)) *2,
    

    There is possibly a much smarter way of writing this, perhaps a loop. Samuel Ellis’ package nhsnumbergenerator creates a table using sequences:

    checkdigit_weights = data.frame(digit.position=seq(from=1, to = 9, by =1),
                                    weight=seq(from=10, to = 2, by =-1)
    )
    

    note he uses = rather than <-

    I like this code. It’s very concise, but it just was easier writing out the multiplications for each part of the NHS Number; the first number is multiplied by 10, the second by 9, the third by 8 and so on.

         End = as.numeric(substring(NHSNumber,10,10)),
    

    I put this in as the final number in the sequence is the check for the later Digit (also called checksum).

         Total = A+B+C+D+E+G+H+I+J,
    

    Just adding all the multiplied columns together.

         Remainder = Total %% 11,
    

    This gives the remainder from a division by 11.

         Digit = 11- Remainder,
    

    11 take away the remainder/checksum.

         Summary = case_when(Digit == 10 ~ 999,
                             Digit == 11 ~ 0,
                             TRUE ~ Digit),
    

    Lots of people use if(else) in R but I like case_when because it’s like SQL. The lines run in logical order:

    • when the digit = 10 then it’s invalid so I put in 999 as that’s so high (in the hundreds) it shouldn’t match the Digit/Checksum (which is a unit),
    • if it’s 11 then change that to 0 following the methodology,
    • else just use the digit number.

    Actually, thinking about it I probably don’t need the 10 becomes 999 as 10 could never match a single unit number. Perhaps that’s redundant code.

         Valid = case_when(Summary == End & length == 10 ~ TRUE,
                           TRUE ~ FALSE
                           )) %>% 
    

    Case_when again but this time to get the TRUE/FALSE validity. If the number generated is the same as the last digit of the NHS Number AND the length of the NHS Number is 10 then TRUE, else FALSE.

    I liked this like of code as it was a bit strange saying TRUE then FALSE but it’s logical!

      filter(Valid == FALSE)
    

    Just bring back what isn’t valid.

    Did it work?

    I think it did. If I got anything wrong I’d love to get feedback but I ran through many hundreds of thousands of NHS Numbers through it and found…..

    No invalid numbers

    I possibly should have checked beforehand but I suspect our clinical systems don’t allow any incorrect NHS Numbers to be entered in at source. Still, it was fun and could be applied to manually entered data from data kept in spreadsheets for example.

    An interesting blog

    As here were no SQL code scripts on GitHub for NHS Number validations I did a quick search on the internet and found this: https://healthanalyst.wordpress.com/2011/08/21/nhs-number-validation/ which is a blog by @HealthAnalystUK from 2011. The reason I’m referring to it in this blog is because HealthAnalystUK not only shared SQL code that looked very similar to the code here but also R and uses it as a function:

    NHSvalidation <- function(NHSnumber){
    
    NHSlength<-length(NHSnumber)
    
    A<-as.numeric(substr(NHSnumber,1,1))
    B<-as.numeric(substr(NHSnumber,2,2))
    C<-as.numeric(substr(NHSnumber,3,3))
    D<-as.numeric(substr(NHSnumber,4,4))
    E<-as.numeric(substr(NHSnumber,5,5))
    F<-as.numeric(substr(NHSnumber,6,6))
    G<-as.numeric(substr(NHSnumber,7,7))
    H<-as.numeric(substr(NHSnumber,8,8))
    I<-as.numeric(substr(NHSnumber,9,9))
    J<-as.numeric(substr(NHSnumber,10,10))
    
    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))
    
    {UniformNumberCheck<-1}
    
    else
    
    {UniformNumberCheck<-0}
    
    Modulus<-((A*10)+(B*9)+(C*8)+(D*7)+(E*6)+(F*5)+(G*4)+(H*3)+(I*2))
    Modulus<-(11-(Modulus%%11))
    
    if (
    
    ((Modulus==J) & (UniformNumberCheck!=1) & (NHSlength==10))|((Modulus==11) & (J==0) & (UniformNumberCheck!=1) & (NHSlength==10)))
    {ReturnValue<-1}
    else
    {ReturnValue<-0}
    
    return(ReturnValue)
    }
    

    I hadn’t coded the check for repeating numbers:

    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))

    and I couldn’t find any reference to this in the Wiki page or the document from the University of Bristol so I’m unsure if this is a part of the methodology. If it is, then I’ve seen at least 1 NHS Number that would fail this test.

    A conclusion

    If anyone has created a package or script for NHS number checks and wants to share please feel free to write a blog. NHS-R Community also has a GitHub repository at https://github.com/nhs-r-community where code like this blog can go (I wrote this in Rmarkdown).

    Blogs can be emailed to nhs.rcommunity@nhs.net and get checked by a group of enthusiast volunteers for publishing.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • How R changed me as an analyst

    I suspect there are many blogs about R and why it’s so great:

    • It’s free!
    • It’s open source!
    • There’s a great community!
    • It’s reproducible!

    You can certainly read on Twitter (#rstats) about what R cando for you but what about what R does to you, particularly as an analyst in the NHS or social care?

    The learning bit…

    Back in April 2018 when NHS-R Community was running its first introduction to R course in Leeds, my only knowledge of it had come from a free online base R course with edX that I hadn’t really understood and never finished. Online learning for me is like being back at school. I get the right answer, feel good, move on and promptly forget it all. After the NHS-R course I dabbled with the dplyr package, tried to run lots of other people’s scripts and generally failed a lot. It was a frustrating time of slow delivery of tasks and bafflement. When things did work, I had no idea why and I often wondered what all the fuss was about particularly as I could do the same things in familiar programmes.

    Hindsight is a wonderful thing and I can now see my frustrations weren’t just one lump of confusion, but could be split down into how I used these ‘familiar programmes’, namely:

    • SQL for data engineering and
    • Excel for visualisations.

    Although I still used (and use) SQL to get my data, I was copying it to Excel and then loading it into R; once loaded I’d then realise I needed to group by and count or remove something I didn’t need and it seemed too long-winded going back to SQL, copying to Excel and then loading it.

    The second frustration of visualising in R came with the inability to replicate the look of the Excel charts in R: getting the same colours, the same font size headers and so on. I’ve yet to resolve that completely but it was here that I realised the worth of R wasn’t in making it look like Excel, but rather that it could do so much more than Excel. I needed to start thinking about what I should be visualising and how to do it.  

    Sharing methodology

    Over my time in the NHS I have learned to be cautious – protective even – of data. But that has led to the misguided fear of sharing technical knowledge which was never a conscious thing, just that’s how it is. However, R has a reputation of sharing which has resulted in an expectation of sharing. And that isn’t just within your own team or organisation, it’s so much wider – it can even be the world.

    As an example of why it’s harder to share Excel methodology, I’ve previously built a benchmarking Excel spreadsheet using MATCH and INDEX so that the bar charts automatically coloured the organisation I worked for and ordered the bars in graphs from the greatest to the smallest. It was one of those tasks that took a lot of effort to automate, looked simple when done but was heavily dependent on the data being in just the right cell or it would break.

    Just updating it with a new year’s data would take great care so the thought of writing out the methodology to share never occurred to me. Writing it out would involve describing the positions of data, what the formulae did and how bits all linked. That’s a laborious task which is not necessary if you don’t plan to share – and as there was no need to share, I didn’t.

    It’s all about the data

    The entire process, from SQL to Excel, is about the data, e.g. how it joins and what it’s counting. To get the data ‘just so’, it often requires so many clever solutions to so many problems that, as I now realise, it consumes so much thinking time that there’s often little energy left for considering why I am doing this – is it the best thing to do and how can I get more meaning from the data?

    If I’d picked up someone else’s script or Excel document on ordering benchmarking data, perhaps the time I spend would be on improving it instead of building it. In a perfect world, I would feed back on what I’d done and share any improvements or corrections.

    But what can R do that SQL and Excel can’t?

    As a very simple example, consider creating a median average.

    In Excel it’s reasonably easy to use the formula MEDIAN() but to make it dynamic (such that it automatically updates if you add more data), the formula becomes much more complicated. Here is a page explaining how it’s done:

    https://www.extendoffice.com/documents/excel/2611-excel-average-dynamic-range.html

    There are lots of graphics are used to describe how to do it and you’ll note this is for AVERAGE which is mean rather than median.

    In SQL, creating the MEDIAN can be solved various ways:

    https://stackoverflow.com/questions/1342898/function-to-calculate-median-in-sql-server

    There are 204 examples to go through to solve this! I didn’t go through it as that’s too much needless thinking required when R can do this…

    median(table$column)

    Something this simple removes all that otherwise inevitable necessary thinking to figure out the best way to get the median… and then having to check that it is right. Although that may be easy enough in Excel, I know I make mistakes and will have to repeat the exercise more times than I care to admit, and doing so in Excel will involve so many steps that each require checking. All of this uses up that precious resource of focused thinking. With R doing the job so quickly and reliably I now have time to consider if median is actually the right thing to use, particularly if it’s a case of “we’ve always done that”. Then I can ponder on what is it telling me; is it increasing/decreasing and why is that? Is there any other data that could help explain this change?  

    Like any great piece of analysis or coding, time invested at the beginning pays off. With the benchmarking example of reordered bar charts, spending a couple of days getting it right made something interactive that was engaging for the users. But Excel continues to require thinking and time checking whereas R doesn’t; once it’s coded, that’s it. And that’s where the analyst should be, that space after the work is done. That’s where you can think about the very analysis itself; was it enough? What does it tell you? What else can you look at? Although I’ve had analyst in my job title for a long time, this is what analysis is all about and it’s something, I now realise, that I’ve not been doing because my “thinking time” has been used up elsewhere.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • Dygraphs

    Dygraphs for mortality surveillance

    I recently presented some of the mortality surveillance charts we use to @RLadiesLondon (a very welcoming group!) and one that got some interest was a chart of Nottinghamshire Healthcare NHS Foundation Trust deaths compared to ONS Provisionally Registered deaths. The chart looks good because it’s interactive but this type of chart can be confusing because of the 2 y axes.

    When I show this report I make it clear that the two axes units are very different and that its purpose is to show that the pattern of the deaths in the wider population matches that of the deaths recorded by the Trust. It’s well known within Public Health that the pattern of deaths is seasonal, with a peak around January in the UK. However, this Public Health knowledge is not necessarily common knowledge in Secondary Care Trusts and it was one of the great benefits of having @IantheBee both create and present this report.

    Getting ONS Provisional Data

    I wrote about getting and formatting the spreadsheets from ONS for the East Midlands Provisionally Registered deaths:

    but for the purposes of the mortality surveillance report I’ve used several years data and I normally keep the spreadsheets, save the merged data and then load that each time I need to run the R markdown report.

    For the purposes of this blog I’ve amended the formatting code by adding functions so each year can be transformed and is available to plot:

    library(janitor) 
    library(readxl)
    library(tidyverse)

    # Download ONS spreadsheets Function-----------------------------------------------

    GetData_function <- function(www,file){
    download.file(www,
    destfile = file,
    method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
    mode = "wb") #wb means "write binary" }

    #2019
    GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2019/publishedweek282019.xls",
    "DeathsDownload2019.xls")

    #2018
    GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2018/publishedweek522018withupdatedrespiratoryrow.xls",
    "DeathsDownload2018.xls")

    #2017
    GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2017/publishedweek522017.xls",
    "DeathsDownload2017.xls")

    #2016 GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2016/publishedweek522016.xls",
    "DeathsDownload2016.xls")

    # Import correct sheets --------------------------------------------------- Deaths_Now <- read_excel("DeathsDownload2019.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2018 <- read_excel("DeathsDownload2018.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2017 <- read_excel("DeathsDownload2017.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2016 <- read_excel("DeathsDownload2016.xls ",
    sheet = 4,
    skip = 2)

    # Look up code to remove excess rows -------------------------------------- LookupList <- c("Week ended",
    "Total deaths, all ages",
    "Total deaths: average of corresponding",
    "E12000004" )

    # Function for formatting data -------------------------------------------- Transform_function <- function(dataframe){

    #Format data frames
    df <- dataframe %>%
    clean_names %>%
    remove_empty(c("rows","cols")) %>%
    filter(week_number %in% LookupList)

    #Transpose table
    df <- t(df)

    #Whilst this is a matrix make the top row the header
    colnames(df) <- df[1, ]

    #Make this object a data.frame
    df <- as.data.frame(df)

    #Function to find 'not in'
    '%!ni%' <- function(x,y)!('%in%'(x,y))

    #List of things to remove to tidy the data.frame
    remove <- c("E12000004", "East Midlands")

    #remove the rows and ensure dates are in the correct format
    df <- df %>%
    filter(E12000004 %!ni% remove) %>%
    mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern")) df$`Week ended` <- as.Date(df$`Week ended`, format = '%Y-%m-%d') df <- df %>% mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    #Final transformation of data
    df %>% select(`Total deaths, all ages`,date) %>%
    filter(!is.na(`Total deaths, all ages`)) %>%
    mutate(`Total deaths, all ages` = as.numeric(as.character(`Total deaths, all ages`))) #To match other data.frames }

    # Run functions -----------------------------------------------------------
    DeathsNow <- Transform_function(Deaths_Now)
    Deaths2018 <- Transform_function(Deaths_2018)
    Deaths2017 <- Transform_function(Deaths_2017)
    Deaths2016 <- Transform_function(Deaths_2016)

    # Merge data -----------------------------------------------------
    Deaths <- bind_rows(DeathsNow,
    Deaths2018,
    Deaths2017,
    Deaths2016) %>%
    mutate(date = as.Date(date),
    `Total deaths, all ages` = as.numeric(`Total deaths, all ages`))

    This code may give a few warnings saying that NAs have been introduced by coercion which is because there are many cells in the spreadsheets that have no data in them at all. It’s a good thing they have nothing (and effectively NAs) as having 0s could confuse analysis as it isn’t clear if the 0 is a real 0 or missing data.

    To suppress warnings in R Markdown add warning=FALSE to the header, however, I like to keep the warnings just in case.

    If you want to keep all the data after merging it together use:

    library(openxlsx) # To write to xls if required. 

    #Export complete list to excel for future
    write.xlsx(Deaths, 'ImportProvisionalDeaths.xlsx')

    If you’ve saved the combined file, to call it again in a script use the following code:

    library(readxl) Deaths <- read_excel("ImportProvisionalDeaths.xlsx") 

    Dygraph chart

    The following data is randomly generated as an example:

    library(tidyverse) 
    library(dygraphs)
    library(xts)

    #Fix the randomly generated numbers set.seed(178)

    Alldeaths <- Deaths %>%
    select(date) %>%
    mutate(n = rnorm(n(), mean=150))

    Merge the two data frames together:

    ONSAlldeaths <- Deaths %>%    
    left_join(Alldeaths, by = "date") %>%
    mutate(ds = as.POSIXct(date)) %>%
    select(ds, y2 = n, y = `Total deaths, all ages`) %>%
    arrange(ds)

    Dygraphs require dates to be in a time series format and the package xts can convert it:

    ONSAlldeaths_series <- xts(ONSAlldeaths, order.by = ONSAlldeaths$ds) 

    The date column is no longer needed so can be removed but this needs to be done using base R and not dplyr:

    #Remove duplicate date column ONSAlldeaths_series <- 
    ONSAlldeaths_series[, -1]

    Finally, the xts can be plotted:

    dygraph(ONSAlldeaths_series, main = "East Midlands Weekly Deaths/Randomly generated numbers") %>%   
    dySeries("y", axis = "y", label = "East Midlands") %>%
    dySeries("y2", axis = "y2", label = "Random numbers") %>%
    dyAxis("x", drawGrid = FALSE) %>%
    dyAxis("y", drawGrid = FALSE, label = "East Midlands") %>%
    dyAxis("y2", independentTicks = TRUE, drawGrid = FALSE, label = "Random numbers") %>%
    dyOptions(stepPlot = TRUE) %>%
    dyRangeSelector()

    Note this chart is not interactive in this blog.

    When you’ve plotted the chart if you wave the cursor over the points you will see information about those points, you are also able to zoom in by using the scrolling bar at the bottom of the chart (this was from the dyRangeSelector() code.

    Other options are detailed here: https://rstudio.github.io/dygraphs/index.html

    ONS Provisional data

    One of the things that may stand out in the chart for the are the big dips around 29-31 December time each year and we presume that these relate to the week where Registrations may be delayed from GP practices to ONS because of the public holidays around Christmas.

    Unfortunately, only provisionally recorded deaths are available by week as confirmed are monthly: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence


    Note this chart is not interactive in this blog.

    The case of the mysteriously disappearing interactive graph in R Markdown html output

    I’d rendered (knit or run) the html reports with the interactive graphs and it had all worked so I emailed the report to people as promised and then the emails came back: “Some of the graphs are missing, can you resend?”. Perplexed, I opened the saved file from the server and, yes, indeed some of the charts had disappeared! Where there should be lovely interactive charts were vast swathes of blank screen. What had happened? The code ran fine, looked fine and how do you even google this mystery?

    Turns out my default browser, and I suspect it is throughout most of the NHS because lots of NHS systems depend on it, is Microsoft Explorer. Whilst I have the latest version these reports have never opened properly in Explorer.

    The solution: Chrome (or some other browser). I ask people to copy the link from the Explorer web address bar after opening it from the email and simply paste it to Chrome.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust. 

    @AppliedInfoNott and @Letxuga007

  • Exact Matching in R

    Exact matching in R: a case study of the health and care costs associated with living alone, using linked data from the London Borough of Barking & Dagenham

    I’ve been working with a group of analysts in East London who are interested in joined-up health and social care data. They’ve created a powerful, unique dataset that shows how each resident of the London Borough of Barking & Dagenham interacts with NHS and council services. The possibilities are enormous, both in terms of understanding the way that people use services across a whole system, and for more general public health research.

    You can download a sample of the data here:

    Today we’re interested in whether social isolation is related to healthcare costs, and we’re going to use exact matching to explore this issue. Our theory is that people who live alone have higher healthcare costs because they have less support from family members.

    Here’s an extract from the data. ‘Total cost’ means the cost of health and social care in one year. The actual dataset has many more variables!

    ID Age groupSex Number of long-term conditionsLives aloneTotal cost
    1 50-64 Male 0 No £93
    2 65-74 Female 1 Yes £0
    3 50-64 Male 4 No £1065
    4 85+ Female 5 Yes £7210

    For the purposes of this blog, we’ll use a sample of 5000 (out of 114,000) individuals, with some changes to the values to ensure anonymity.

    Descriptive stats

    The first thing we’ll look at is whether people who live alone are different. I find descriptive tables fiddly so often use a function that can be re-used across a number of variables:

    # first, save the data provided above and change your working directory to the place you've saved it with setwd()
    # read data
    library(data.table) # useful for efficient reading and summarising of data
    d <- fread("sample_lbbd_dataset.csv")
     
    # describe age, sex and number of long-term conditions
    describe.by.isolation <- function(variable) {
    a <- table(unlist(d[, variable, with = F]), d$livesalone)
     p <- round(prop.table(a, 2) * 100, 0)
     matrix(paste0(a, ' (', p, ')'), ncol = 2, dimnames = list(row.names(a), c('Live with others', 'Lives alone')))
    }
    lapply(c('age_grp', 'sex', 'LTC_grp'), describe.by.isolation)
     
    # describe healthcare costs
    d[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
     
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(d$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(d$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    People who live alone are older, more likely to be female and have more long-term health problems. Their mean healthcare costs are £2,347 higher. The difference in healthcare costs is visible on the histograms, which we have displayed on a log scale because some values are extremely high. There’s some ‘zero inflation’ in both groups – people with no healthcare costs who do not fit into the otherwise lognormal-ish distribution. So far not too surprising – but are the increased healthcare costs explained by the differences in age, sex and health?

    Regression

    One approach would be to use regression. We could fit a linear model – this is actually not a great fit for patients with higher healthcare costs, but we won’t go into that here.

    linear_model <- lm(cost ~ livesalone + age_grp + sex + LTC_grp, d)
    plot(linear_model) # diagnostics show that a linear model is not a great fit. You might have to press return to see all the plots before you can continue.
    summary(linear_model)

    The results suggest that health and care costs for people who live alone are £892 more than those who do not live alone, on average. Clearly the variables we added to the model are important confounders, as this is much lower than the ‘crude’ difference of £2,347. Two limitations of this approach are that we will have to think quite carefully about the fit of the model to the data, and that we can’t describe how living alone affects the distribution of costs.

    Exact matching

    We therefore used exact matching, in which each individual who lives alone is matched to an individual who does not live alone, based on specified variables such as age group and sex. We developed a function that does this by stratifying the data and then matching ‘cases’ and ‘controls’ randomly. Unmatched individuals are deleted from the dataset, leaving a dataset that is balanced in terms of the variables you specify. Let me know here if there’s any other functionality you want and we can try and incorporate it.

    Let’s try matching on age group, sex and the grouped number of long-term conditions:

    # data = dataset containing:
    # - treatment/exposure variable 'mvar' (a string specifying variable name).
    # - matching variable 'mvar' (a string specifying variable name). If you want to match on multiple variables, concatenate them first.
    # other inputs are:
    # - ratio of cases:controls (an integer > 0)
    # - seed for fixing random selection of cases/controls (an integer; default NULL means no seed). Choice of seed is arbitrary.
    # returns data.table of matched observations, with additional variable 'id' for use in paired/grouped analyses
     
    smatch <- function (data, treat, mvar, ratio = 1, seed = NULL) {
     setnames(data, mvar, '.mvar')
     targ <- data[, .(case = sum(get(treat)), control = sum(!get(treat))), .mvar]
     targ[, cst := floor(pmin(control / ratio, case))]
     targ[, cnt := cst * ratio]
     targ <- targ[cst > 0]
     l2 <- cumsum(targ$cst)
     ids <- mapply(':', c(0, l2[-nrow(targ)]), l2-1)
     names(ids) <- targ$.mvar
     case <- NULL
     control <- NULL
     x <- .Random.seed
     set.seed(seed)
     on.exit({.Random.seed <- x})
     for(i in targ$.mvar) {
       case[[i]] <- data[get(treat) == T & .mvar == i][sample(.N, targ$cst[targ$.mvar == i])]
       case[[i]][, id := ids[[i]]]
       control[[i]] <- data[get(treat) == F & .mvar == i][sample(.N, targ$cnt[targ$.mvar == i])]
       control[[i]][, id := rep(ids[[i]], each = ratio)]
      }
     rbindlist(c(case, control))
    }
     
    # create a single variable summarising matching variables
    d$mvar <- do.call('paste0', d[, c('age_grp', 'sex', 'LTC_grp')])
     
    # create 1:1 matched dataset.
    matched_data <- smatch(d, treat = 'livesalone', mvar = 'mvar', ratio = 1, seed = 74)
     
    # check balance: same number of individuals in each group
    dcast(matched_data, age_grp + sex + LTC_grp ~ livesalone, value.var = 'id', fun.aggregate = length)

    Now we have a dataset that is balanced in terms of age, sex and the count of long-term conditions. Let’s see how healthcare costs compare:

    matched_data[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
     
    # histograms of cost
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(matched_data$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(matched_data$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    # t-test (in reality you might want a paired test, and to check whether a t-test is appropriate)
    t.test(cost ~ livesalone, matched_data) # notice how wide the confidence intervals are for this reduced dataset

    # proportion with costs over £10000
    matched_data[, .(over10k = sum(cost > 10000) / .N), livesalone]

    The mean difference is £803. When we used the whole dataset, this value was even closer to the coefficient from linear regression. It’s now difficult to see a difference in the histograms, but you can easily create any description of the distribution that you like – e.g. the proportion of patients that have costs over £10,000.

    Who got matched and who didn’t?

    The point of matching was to create a comparison group of people who don’t live alone who were in some ways similar to the group who do. We probably had to delete lots of people who don’t live alone in a systematic way (e.g. men and younger people who do not live alone were more likely to be deleted). We might also have deleted some of the group who do live alone, which could be more problematic if we want to generalise our results to the population. Let’s see who got deleted…

     d[, matched := ifelse(PID %in% matched_data$PID, 'matched', 'unmatched')]
     
    # just looking at age group for now
    compare_matched <- dcast(d, age_grp ~ livesalone + matched, value.var = 'PID')
    compare_matched[, 'TRUE_total' := TRUE_matched + TRUE_unmatched]
    compare_matched[, lapply(.SD, function(x) x/sum(x) * 100), .SDcols = 2:6]

    You can see that some of the ‘lives alone’ group got deleted (‘TRUE_unmatched’), and they were all in the older age groups. The difference between everyone who lives alone (‘TRUE_total’) and the matched group (‘TRUE_matched’) is diluted, because the oldest groups are a relatively small part of the data. Nonetheless, I would say this is a fairly important difference. If you are concerned about generalisability to the population, you might want to restrict the analysis to people aged under 85. In the full dataset this was not be a problem (as there were lots more potential ‘controls’), but you might encounter similar issues if you match on more detailed variables.

    Final point! We’ve focused on a technical solution to exact matching. We haven’t really thought about which variables we should be matching on. This is a matter of judgement, and needs a bit of thinking before we dive into the matching process. Just like covariates in regression, the matching variables are confounders – they’re alternative explanations for an association between living alone and healthcare costs. Age and sex are clearly important. Health status is more complex. Do you want to think about long-term conditions as something that cause people to live alone, or the product of living alone? If people are sicker because they live alone, matching on the number of long term conditions might falsely reduce the association between living alone and healthcare costs.

    This blog was written by Dan Lewer, NIHR Doctoral Research Fellow / SpR Public Health. Department of Epidemiology & Public Health, UCL

    Twitter: danlewer

  • Count of working days function

    It’s at this time of year I need to renew my season ticket and I usually get one for the year. Out of interest, I wanted to find out how much the ticket cost per day, taking into account I don’t use it on weekends or my paid holidays. I started my workings out initially in Excel but got as far as typing the formula =WORKDAYS() before I realised it was going to take some working out and perhaps I should give it a go in R as a function…

    @ChrisBeeley had recently shown me functions in R and I was surprised how familiar they were as I’ve seen them on Stack Overflow (usually skimmed over those) and they are similar to functions in SQL which I’ve used (not written) where you feed in parameters. When I write code I try to work out how each part works and build it up but writing a function requires running the whole thing and then checking the result, the objects that are created in the function do not materialise so are never available to check. Not having objects building up in the environment console is one of the benefits of using a function, that and not repeating scripts which then ALL need updating if something changes.

    Bus ticket function

    This is the final function which if you run you’ll see just creates a function.

    #Week starts on Sunday (1)
    DailyBusFare_function <- function(StartDate,EmployHoliday,Cost,wfh){
      startDate <- dmy(StartDate)
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <-seq(from = startDate, to = endDate, by = "days")
      working_days <- sum(wday(myDates)>1&wday(myDates)<7)-length(holidayLONDON(year = lubridate::year(startDate))) - EmployHoliday - wfh
    per_day <- Cost/working_days
    print(per_day)
    }

    Running the function you feed in parameters which don’t create their own objects:

     DailyBusFare_function("11/07/2019",27,612,1) 

    [1] 2.707965

    Going through each line:

    To make sure each part within the function works it’s best to write it in another script or move the bit betweeen the curly brackets {}.

    Firstly, the startDate is self explanatory but within the function I’ve set the endDate to be dependent upon the startDate and be automatically 1 year later.

    Originally when I was trying to find the “year after” a date I found some documentation about lubridate’s function dyear:

    #Next couple of lines needed to run the endDate line!
    library(lubridate)
    startDate <- dmy("11/07/2019")
    endDate <- startDate + dyears(1)

    but this gives an exact year after a given date and doesn’t take into account leap years. I only remember this because 2020 will be a leap year so the end date I got was a day out!

    Instead, @ChrisBeeley suggested the following:

    endDate <- as.Date(startDate) %m+% months(12)

    Next, the code builds a sequence of days. I got this idea of building up the days from the blogs on getting days between two dates but it has also come in use when plotting over time in things like SPCs when some of the time periods are not in the dataset but would make sense appearing as 0 count.

    library(lubridate)

    #To run so that the sequencing works
    #using as.Date() returns incorrect date formats 0011-07-20 so use dmy from
    #lubridate to transform the date
      startDate <- dmy("11/07/2019")
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <- seq(from = startDate, to = endDate, by = "days")

    All of these return values so trying to open them from the Global Environment won’t do anything. It is possible view the first parts of the values but also typing:

    #compactly displays the structure of object, including the format (date in this case)
    str(myDates)

    Date[1:367], format: “2019-07-11” “2019-07-12” “2019-07-13” “2019-07-14” “2019-07-15” …

    #gives a summary of the structure
    summary(myDates)
    Min.      1st Qu.       Median         Mean      3rd Qu.

    “2019-07-11” “2019-10-10” “2020-01-10” “2020-01-10” “2020-04-10” Max. “2020-07-11”

    To find out what a function does type ?str or ?summary in the console. The help file will then appear in the bottom right Help screen.

    Next I worked out working_days. I got the idea from a blog which said to use length and which:

    working_days <- length(which((wday(myDates)>1&wday(myDates)<7)))

    Note that the value appears as 262L which is a count of a logical vector. Typing ?logical into the Console gives this in the Help:

    Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer._

    I was familiar with “length”, it does a count essentially of factors or vectors (type ?length into the Console for information) but “which” wasn’t something I knew about. @ChrisBeeley explained what which does with the following example:

    #Generate a list of random logical values
    a <- sample(c(TRUE, FALSE), 10, replace = TRUE)
    #Look at list
    a

    [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE

    #using which against the list
    gives the number in the list where the logic = TRUE
    which(a)

    [1] 4 5 8

    #counts how many logical
    arguments in the list (should be 10)
    length(a)

    [1] 10

    #counts the number of TRUE
    logical arguments
    length(which(a))

    [1] 3

    Then @ChrisBeeley suggested just using “sum” instead of length(which()) which counts a logical vector:

    sum(a)

    [1] 3

    It seems this has been discussed on Stack Overflow before: https://stack overflow.com/questions/2190756/how-to-count-true-values-in-a-logical-vector

    It’s worthy of note that using sum will also count NAs so the example on Stack Overflow suggest adding:

    sum(a, na.rm = TRUE)

    [1] 3

    This won’t affect the objects created in this blog as there were no NAs in them but it’s just something that could cause a problem if used in a different context.

    working_days <- sum(wday(myDates)>1&wday(myDates)<7)
    #Just to check adding na.rm = TRUE gives the same result
      working_days <- sum(wday(myDates)>1&wday(myDates)<7, na.rm = TRUE)

    I then wanted to take into account bank/public holidays as I wouldn’t use the ticket on those days so I used the function holidayLONDON from the package timeDate:

    length(holidayLONDON(year = lubridate::year(startDate)))

    [1] 8

    I used lubridate::year because the package timeDate uses a parameter called year so the code would read year = year(startDate) which is confusing to me let alone the function!

    Again, I counted the vectors using length. This is a crude way of getting bank/public holidays as it is looking at a calendar year and not a period (July to July in this case).

    I did look at a package called bizdays but whilst that seemed to be good for building a calendar I couldn’t work out how to make it work so I just stuck with the timeDate package. I think as I get more confident in R it might be something I could look into the actual code for because all packages are open source and available to view through CRAN or GitHub.

    Back to the function…

    I then added “- EmployHoliday” so I could reduce the days by my paid holidays and “- wfh” to take into account days I’ve worked from home and therefore not travelled into work.

    The next part of the code takes the entered Cost and divides by the Working_days, printing the output to the screen:

    per_day <- Cost/working_days

    print(per_day)

    And so the answer to the cost per day is printed in the Console:

    DailyBusFare_function("11/07/2019",27,612,1)

    [1] 2.707965

    A conclusion… of sorts

    Whilst this isn’t really related to the NHS it’s been useful to go through the process of producing a function to solve a problem and then to explain it, line by line, for the benefit of others.

    I’d recommend doing this to further your knowledge of R at whatever level you are and particularly if you are just learning or consider yourself a novice as sometimes blogs don’t always detail the reasons why things were done (or why they were not done because it all went wrong!).

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @AppliedInfoNott and @Letxuga007

    The R Markdown file is available here:
    https://github.com/nhs-r-community/NHSR_blogs/tree/master/Count_of_working_days_ZTurner_201907

  • SQL Server Database connections in R

    Getting data into R

    One of the things I found most difficult when learning R was getting data from our SQL Servers into R to analyse. It is easy to load csv files, or pull example datasets from packages, but a little more knowledge is required to connect to external databases. I think this is a common problem for my NHS colleagues when learning R and probably for others too. This post is a brief introduction to the two main ways to pull data in to R using RODBC and using dplyr‘s ’DBI-compliant’ database connections.
    I’ll be talking about connections with Microsoft SQL Server (over a local network), but this can also extend to other types of database by using different drivers, or other network set-ups using the right authentication.

    Where is the data stored?

    If you are using SQL Server to store your data, it is written into files on the database server and accessed using SQL scripts in your chosen interface (I use SQL Server Management Studios). Whilst the data are stored on disk on the SQL Server, R stores data in memory (RAM) on your machine. This has the advantage of quick access, but you can easily run out of memory with larger datasets, or processes that build larger matrices (like some types of statistical model). If memory is an issue, you will probably get the error message: Error: cannot allocate vector of size .... If you hit this situation, I’d recommend trying the data.table as an alternative to using data.frames. It is much faster, and has a lower memory footprint. Here’s a great blog post about it from Megan Stodel: https://www.meganstodel.com/posts/data-table/

    Two common methods

    There are two common methods of connection, both of which use Open Database Connectivity (ODBC) drivers:

    1. The RODBC package.
    2. The DBI system, using dplyr, dbplyr and odbc.

    Both of these create a connection, using a ‘connection string’ with the details of server/credentials etc., this can be used to create a connection object, from which we can pull data into R or manipulate it in the database.

    A note on security

    To prevent publishing our server or database names in this post, I’ve used an alias that goes and looks them up internally. There are a few options for doing this in your own code, but I’ve added them to my ‘.Renviron’ file, as SERVER and DATABASE variables. My code looks them up each time using the Sys.getenv function that you’ll see in the examples below.

    This has meant using the paste function to concatenate the variables together in the ROBDC example. You won’t have to do this in your own work if you replace the server and database names directly.

    1. RODBC

    This is the simpler of the two interfaces, and uses slightly older code. It can be used to connect to anything that uses Open Database Connectivity (ODBC). I’ll define a connection string to a database server, a database, and a table called ‘MyTable’ that has some dummy data in it. If you haven’t got any of the packages used in this post, you can install them with: install.packages("RODBC") for example.

    library("RODBC") 
    
    #Connection string 
    RODBC_connection <- odbcDriverConnect(paste('driver={SQL 
                                          Server};server=',                                       
                                          Sys.getenv("SERVER"),                                       
                                          ';database=',                                       
                                          Sys.getenv("DATABASE"),                                       
                                          ';trusted_connection=true', sep = "")) 
    
    # e.g. with a server called "Cliff" and a database called "Richard" your string would be: 
    # driver={SQL Server};server=Cliff;database=Richard;trusted_connection=true')                            
    
    dt1 <- sqlFetch(channel=RODBC_connection, sqtable = "MyTable") 
    
    # Load data from SQL query 
    dt2 <- sqlQuery(channel=RODBC_connection, query = "select TOP 100 * from MyTable") 

    Quite straightforward to use! In the example above, I specified trusted_connection=true. In a windows environment, this passes your windows credentials to the server. Since we use these for access permissions on our SQL Servers, we can use this method with no issues. You can, instead, specify a username (uid) and a password (pwd): see the help files for more details, using: ?odbcDriverConnect.

    You can also use RODBC to write back to database tables, choosing whether or not to append your results using the append and safer arguments. Not appending means you will overwrite the table:

    sqlSave(channel = RODBC_connection, dat = dt2, tablename = "Mytable_version2", append = FALSE, safer = FALSE) 

    There are lots of other functions included with RODBC to allow you to see structures etc. The package vignette is a very helpful place to go for this, along with the help files.

    Remember to disconnect at the end of your session:

    odbcClose(RODBC_connection) 

    If you do this a lot, you might find Gary Hutson’s recent post, showing how to wrap some of this into a function, a useful addition. Check it out here: http://hutsons-hacks.info/rodbc-helper-function.

    2. DBI  dplyr

    The RODBC interface was simple, quick, and you may not need to consider another approach, but I prefer to use the tidyverse functions linked to dplyr. These functions are maturing in the last couple of years, and have a few major advantages:

    • Work with tidyverse functions, including dplyr verbs and the pipe %>%
    • Faster than RODBC to import data
    • Can be used to work with data in the database, without importing it into R.

    The connection string is slightly different, and we require a few more packages to make this work. You need to make sure you have the following installed:

    • dplyr – to make the tbl and use it, we’ll work with dplyr syntax.
    • DBI – a common Database Interface engine for use in S and R (see: https://cran.r-project.org/web/packages/DBI/vignettes/DBI-1.html)
    • dbplyr – this add-on package allows translation from dplyr to SQL.
    • odbc– provides the odbc drivers, but you could use the functions below with other drivers instead.
    library(dplyr) 
    library(dbplyr)
    library(odbc)
    library(DBI)

    DBI_Connection <- dbConnect(odbc(),
    driver = "SQL Server",
    server=Sys.getenv("SERVER"),
    database=Sys.getenv("DATABASE")
    )

    Now we can define a table as if it was part of our R workspace, using the connection object and the names of the table in the database. We can then interact with it directly using dplyrglimpse is a useful function that shows you the column names, datatypes and top few entries:

    MyTable<-tbl(DBI_Connection, "MyTable") 

    glimpse(MyTable)
    ## Observations: ??
    ## Variables: 7
    ## Database: Microsoft SQL Server
    ## $ id <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
    ## $ Org <chr> "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "...
    ## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201...
    ## $ month <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
    ## $ Category_1 <dbl> 35395, 21126, 9248, 4049, 5323, 16867, 9916, 12404,...
    ## $ Cateogry_2 <dbl> 39293, 24860, 11031, 5812, 6876, 18475, 12976, 1657...
    ## $ events <int> 1986, 429, 460, 301, 289, 1172, 446, 756, 663, 874,...

    MyTable %>%
    filter(year ==2015) %>%
    group_by(month) %>%
    summarise(AvgEvents = mean(events),
    MaxEvents = max(events),
    N = n()) %>%
    arrange(month)
    ## # Source: lazy query [?? x 4]
    ## # Database: Microsoft SQL Server
    ## # Ordered by: month ## month AvgEvents MaxEvents N
    ## <int> <int> <int> <int>
    ## 1 4 624 1986 25
    ## 2 5 658 1948 25
    ## 3 6 671 2068 25
    ## 4 7 669 1879 25
    ## 5 8 630 1981 25
    ## 6 9 649 2011 25

    dplyr can then be used to do fairly complex things in just a few lines. The example below is not very well thought-out, but it takes data from the database into a summary plot in just a few lines. I’m filtering the data for 2015 and passing it directly into ggplot2. I’ve set a few options for a box plot, but it’s quite minimal, and our data has remained in the database and not been imported to R.

    library(ggplot2) 

    MyTable %>%
    filter(year ==2015) %>%
    ggplot(aes(y=events, x=factor(month), group=factor(month))) +
    geom_boxplot(fill = "dodgerblue2", alpha=0.6, )+
    labs(title = "Monthly Distribution of Events", x="Month", y="Events")

    You can, of course, write an SQL query directly using the dbSendQuery function. This executes the query on the server-side only, so if you want the results to be returned back to R, you need to use dbFetch as well. You might need this approach if you are doing fancy queries, or things that are specific to a database environment that don’t yet have translators in dplyr.

    SomeRecords <- dbFetch(dbSendQuery(DBI_Connection, "Select TOP 100 * from MyTable")) 

    #or

    SomeRecords <- dbSendQuery(DBI_Connection, "Select TOP 100 * from MyTable") %>%
    dbFetch()

    You may not need to write a custom query for everyday use, but you are still likely to need to pull the data from the server into memory in R sometimes. For me, this is often to build models on it, as that isn’t supported in-database. You can use the collect function for this. For example, using part of the query from earlier as an example:

    MyTable_local<- MyTable %>%   
    filter(year ==2015) %>%
    group_by(month) %>%
    summarise(AvgEvents = mean(events),
    MaxEvents = max(events),
    N = n()) %>%
    arrange(month) %>%
    collect()

    print(MyTable_local)
    ## # A tibble: 6 x 4 
    ## month AvgEvents MaxEvents N
    ## <int> <int> <int> <int>
    ## 1 4 624 1986 25
    ## 2 5 658 1948 25
    ## 3 6 671 2068 25
    ## 4 7 669 1879 25
    ## 5 8 630 1981 25
    ## 6 9 649 2011 25

    You can also write back to a database using the dbWriteTable function. The following code snippet writes a new table to my current connection, called ‘NewDatabaseTable’ using the local data.frame called MyTable_local (that we created in the last section). The append option indicates whether to add to an existing table or not, and overwrite is just what it sounds like:

    dbWriteTable(DBI_Connection,"NewDatabaseTable", MyTable_local, overwrite=TRUE)

    Summary

    Database connections are common in most analysis and data science settings. R can easily connect to relational databases, reading and writing with either RODBC or dplyr/DBI packages. I prefer the DBI route as it is faster, plugs into dplyr allowing ‘piping’ between functions, and allows us to work with data whilst it is still on the database server. This post just scratches the surface, but there are many more options for executing procedures, using different schema etc. This post has been targeted at working with Microsoft SQL Server, but these processes will work well with other databases by switching to the right drivers.

    This blog was written by Chris Mainey, Intelligence Analyst from the Health Informatics team at University Hospitals Birmingham NHS Foundation Trust. The blog was originally posted here:https://mainard.co.uk/post/database-connections-in-r/

  • Introduction to Funnel Plots

    Funnel plots

    Funnel plots are a common tool for comparing organisations or units using proportions or standardised rates. A common use of them is for monitoring mortality at hospitals. This is an introductory post on the subject, that gives a little information about them and how they are constructed. It is deliberately light on theory, focussing on use, some of the theory is referenced for interested readers.

    This post also uses a funnel plot function, for indirectly standardised ratios, that I built as part of my PhD work. The function is based on ggplot2(Wickham 2009), and is available at https://github.com/chrismainey/CMFunnels, although it’s a work in progress.

    This post was original written for my own blog (https://mainard.co.uk/post/introduction-to-funnel-plots/), built using RMarkdown, with blogdown, using a Hugo site. Code is avaialble at https://github.com/chrismainey/Mainard/tree/master/content/post

    There are different kinds of funnel plot, but this post focusses on the type used to compare standardised mortality and other similarly constructed indicators .

    Why do we use them?

    Rationale

    How do you go about comparing organisations? We could simply look at indicator data and rank them, but that could be unfair if the conditions are different at each organisation. E.g. every hospital differs in size, the services it offers, and the patients it sees. We might expect a hospital seeing a higher proportion of elderly patients to have a higher mortality rate. Is it fair to compare it to an organisation serving a younger population who may be ‘healthier’ in general? Naively comparing organisations by ranking in league tables has been shown to be a bad idea (Goldstein and Spiegelhalter 1996; Lilford et al. 2004).

    This scenario is not a million miles away from the techniques used in meta-analysis of clinical trial, where we may have trials of different sizes, with different estimates of effect, and differing variances. Some of the techniques applied to meta-analysis have been adapted for healthcare monitoring, including funnel plots and methods to adjust for overdispersion (Spiegelhalter 2005a, 2005b; Spiegelhalter et al. 2012).

    Construction

    If we want to compare a standardised ratio or similar indicator, we can make a plot with the indicator on the Y-axis, and a measure of the unit size on the X-axis. This is commonly the sum of the predicted values for standardised ratios (e.g. the predicted number of cases), or the number of patients/discharges etc. Our centre line, the average value, can be surrounded by ‘control limits,’ a concept from Statistical Process Control. These limits are statistical boundaries to separate natural (‘common-cause’) variation and systematic differences (‘special-cause variation’) (Mohammed et al. 2001). This is commonly at organisational level, but could be at any aggregation.

    The reason these limits resemble a funnel is due to the effects of size. The expected variation is larger when we are looking at fewer cases. For example, imagine an experiment where we toss an unbiased coin to see the expected value. If we toss that coin twice and both are ‘heads,’ our data is telling us that all coin tosses end up as ‘heads.’ This is not true, and we are making an assumption that we know would be different if we repeated it more times. The margin of error around this is high. So if we performed the same experiment 10, 100 or 1000 times, we would expect it to become 50:50, heads/tails, and the margin of error is proportionally smaller. This is also true of indicators based on counts, like funnel plots. We expect less variation between points as organisations get larger.

    Example:

    library(ggplot2)
    library(tidyr)
    
    # Make up some data, as if it was from a regression model with observed and predicted (expected) events.
    dt <- data.frame(observed = c( 15,40,72,28,50, 66, 75),
                     expected = c( 13,32,75,33,54, 60, 72),
                     unit = factor(c("A","B","c","D","E", "F", "G"))
    )
    
    # Add a ratio (SR) of observed to expected, our indicator 
    dt$SR <- dt$observed / dt$expected
    
    # Scatter plot in ggplot
    a<-ggplot(dt, aes(x=expected, y= SR))+
      geom_point()
    
    a
    # Now add a central line, in a ration like this, 1 is the average/expected value.
    a<- a+geom_hline(aes(yintercept=1))
    a
    # Add a 95% Poisson limit, by using the density function to get the quantile value for each 'expected'.
    lkup<-data.frame(id=seq(1, max(dt$expected), 1))
    lkup$Upper<-(qpois(0.975,lambda = lkup$id) - 0.025) / lkup$id
    lkup$lower<-(qpois(0.025,lambda = lkup$id) - 0.975) / lkup$id
    
    lkup<-gather(lkup, key, value,-id)
    
    a+ geom_line(aes(x=id, y=value, col=key), data=lkup)

    You’ll probably notice the ‘jagged’ lines in the plot above. This is because the Poisson distribution is only defined on integers, and most common implementations of Poisson functions make some sort of rounding/guess between points. They are generally poorly defined on low values, but there are other options that I’ll discuss in another future post.

    Expanding limits

    The methods described above have been developed into a basic R package to draw these plots using ggplot2. It also allows users to specify whether they want ‘overdispersed’ limits. I will write another post about overdispersion in the coming weeks, but essentially, we have more variation than we would expect from theory alone. To account for this, we can estimate how much greater the variance in our data is, and expand the funnel limits by this amount.

    Part of this process involves ‘Winsorisation’ of the distribution (Spiegelhalter 2005b; Spiegelhalter et al. 2012), where we set the outer most values to a defined percentile to reduce the effects of outliers. This is commonly set to 10% at each end of the distribution, but there is a variant method for this, used in the NHS’ Summary Hospital Mortality Indicator’, where the distribution is truncated instead of Winsorised (Clinical Indicators Team 2018).

    I originally wrote this package to present plots for my PhD thesis, focussed on predicting NRLS incident reporting ratios after risk-adjustment. The overdispersion was particularly high in this case, and differences between the two methods were noticeable, with the SHMI/truncation method appearing better suited.

    Application

    Here we will apply this to some data by picking up the medpar dataset discussed by Hilbe and available in the COUNT package (Hilbe 2014). It is a set of data points from hospitals in Arizona, in 1991, based on US Medicare data. We’ll use the ‘length of stay’ field ’, los, and model it from the other predictors in the data.

    Installation

    devtools::install_github("https://github.com/chrismainey/CMFunnels")

    Basic model build

    We will first load the data and build a simple predictive model, using a Poisson GLM, with a few of the predictors from the dataset. This post is not focussed on modelling techniques, but a Poisson Generalised Linear Model (GLM) is more appropriate for count data than linear regression. The key message, though, is that Poisson models make no adjustment for the variance within the data and are likely to be overdispersed. A more sophisticated approach might use something like a negative binomial or multilevel model (discussed in a later post).

    A little reformatting is required before modelling:

    library(CMFunnels)
    library(COUNT)
    library(ggplot2)
    
    data(medpar)
    medpar$provnum<-factor(medpar$provnum)
    medpar$los<-as.numeric(medpar$los)
    
    mod<- glm(los ~ hmo + died + age80 + factor(type), family="poisson", data=medpar)
    summary(mod)
    ## 
    ## Call:
    ## glm(formula = los ~ hmo + died + age80 + factor(type), family = "poisson", 
    ##     data = medpar)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -5.7309  -1.9554  -0.5529   0.9717  14.5487  
    ## 
    ## Coefficients:
    ##               Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)    2.26875    0.01246 182.011  < 2e-16 ***
    ## hmo           -0.07637    0.02393  -3.192  0.00142 ** 
    ## died          -0.24574    0.01826 -13.458  < 2e-16 ***
    ## age80         -0.02141    0.02050  -1.045  0.29617    
    ## factor(type)2  0.24921    0.02099  11.871  < 2e-16 ***
    ## factor(type)3  0.74869    0.02627  28.496  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 8901.1  on 1494  degrees of freedom
    ## Residual deviance: 7977.7  on 1489  degrees of freedom
    ## AIC: 13705
    ## 
    ## Number of Fisher Scoring iterations: 5


    Now we have a regression that we can use to get a predicted los that we will compare to observed los:

    medpar$prds<- predict(mod, type="response")

    Build plot

    Now we can build a funnel plot object with standard Poisson limits, and outliers labelled. The function returns a list of the plotted data, the plotted control limit range, and the ggplot object, hence object[3] to call it.

    my_plot<-funnel_plot(predictions=medpar$prds,observed=medpar$los, group = medpar$provnum, 
                title = 'Length of Stay Funnel plot for `medpar` data', 
                Poisson_limits = TRUE, OD_Tau2 = FALSE,label_outliers = TRUE)
    
    my_plot[3]
    ## [[1]]

    Overdispersion

    That looks like too many outliers! There is more variation in our data than we would expect, and this is referred to as: overdispersion. So lets check for it: 
    The following ratio should be 1 if our data are conforming to Poisson distribution assumption (conditional mean = variance). If it is greater than 1, we have overdispersion:

    sum(mod$weights * mod$residuals^2)/mod$df.residual

    ## [1] 6.240519

    This suggests the variance is 6.24 times the condition mean, and definitely overdispersed. This is a huge topic, but applying overdispersed limits using either SHMI or Spieglehalter methods adjust for this by inflating the limits:

    my_plot2<-funnel_plot(predictions=medpar$prds,observed=medpar$los, group = medpar$provnum, 
                title = 'Length of Stay Funnel plot for `medpar` data', 
                Poisson_limits = FALSE, OD_Tau2 = TRUE, method = "SHMI",label_outliers = TRUE)
    
    my_plot2[3]
    ## [[1]]

    Given these adjustments, we now only have nine organisations showing special-cause variation. To interpret this plot properly, we would first investigate these outlying organisations before making any changes to the system/indicator. We should check for possible data quality issues, such as errors, missing model predictors, environmental factors (e.g. one organisation changing computer systems and data standards etc. during the monitoring period), but once these are examined we might suspect issues with care at the hospitals in question. They can then be investigated by local audit and casenote review.

    These methods can be used for any similar indicators, e.g. standardised mortality ratios, readmissions etc.

    Summary

    Funnel plots are useful ways to visualise indicators such as mortality, readmission and length of stay data at hospitals, that presents both the indicator value but also a measure of the size/variance at organisations. They allow limits to be drawn between what we might expect by chance, and what we might consider to be a signal for investigation. Organisations outside the funnel limits should be examined, first for data quality issues and then for issues with process and clinical care. Overdispersion means that these limits are often too strict, but they can be inflated to adjusted for this.

    If you’d like to use my outline ggplot function, or contribute, please pull or fork it on github: https://github.com/chrismainey/CMFunnels

    References

    Clinical Indicators Team, NHS Digital. 2018. “Summary Hospital-Level Mortality Indicator (SHMI) – Indicator Specification.” NHS Digital.

    Goldstein, Harvey, and David J. Spiegelhalter. 1996. “League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 159 (3): 385–409. https://doi.org/10/chf9kj.

    Hilbe, Joseph M. 2014. Modeling Count Data. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139236065.

    Lilford, R., M. A. Mohammed, D. Spiegelhalter, and R. Thomson. 2004. “Use and Misuse of Process and Outcome Data in Managing Performance of Acute Medical Care: Avoiding Institutional Stigma.” Lancet 363 (9415): 1147–54. https://doi.org/10.1016/s0140-6736(04)15901-1.

    Mohammed, Mohammed A, KK Cheng, Andrew Rouse, and Tom Marshall. 2001. “Bristol, Shipman, and Clinical Governance: Shewhart’s Forgotten Lessons.” The Lancet 357 (9254): 463–67. https://doi.org/10/cqjskf.

    Spiegelhalter, David J.  (2005a) . “Funnel Plots for Comparing Institutional Performance.” Stat Med 24 (8): 1185–1202. https://doi.org/10.1002/sim.1970.

    Spiegelhalter, David J.  (2005b). “Handling over-Dispersion of Performance Indicators.” Quality and Safety in Health Care 14 (5): 347–51. https://doi.org/10.1136/qshc.2005.013755.

    Spiegelhalter, David J., Christopher Sherlaw-Johnson, Martin Bardsley, Ian Blunt, Christopher Wood, and Olivia Grigg. 2012. “Statistical Methods for Healthcare Regulation: Rating, Screening and Surveillance.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 175 (1): 1–47. https://doi.org/10.1111/j.1467-985X.2011.01010.x.

    Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. New York: Springer-Verlag.

    This blog was written by Chris Mainey, Intelligence Analyst from the Health Informatics team at University Hospitals Birmingham NHS Foundation Trust.

  • rforhealthcare.org – A free online resource for all your R related healthcare needs

    Down in the South West of England we at the PenCHORD team of the NIHR CLAHRC South West Peninsula have been busy developing some online treats for all of you interested in using the statistical programming language R. The www.rforhealthcare.org website is a free resource full of R guides and tutorials designed to get you started with R and RStudio. But we do not stop at simply introducing you to the R language, oh no, we have begun to produce tutorials on more complex topics such as linear regression, performing t-tests, statistical process control and principle component analyses.

    The rationale behind developing this site has been to create an easily accessible resource for those working in and around the healthcare sector to support you in using R on a day-to-day basis as an alternative to Excel and SQL. The training topics have been selected so they are relevant to the types of tasks carried out by healthcare analysts while explaining things in such a way that you do not need a PhD in computer science to understand what is going on!

    The website has been designed to be simple and user friendly. If you are completely new to R there is an introductory presentation that you can download on the homepage. This describes what R and RStudio are, how to install them and some basic information about how to go about using R.

    Once you have R and RStudio installed then it is time to start learning about R functionality. There are currently three main sections to the tutorials: Basic functionality, Plotting and Statistics. The ‘Basic functionality’ section covers everything from introducing data types and data structures to transforming and manipulating data through to writing your own functions. The plotting section introduces you to creating all of the basic types of graphs that you need from histograms to box plots. The statistics section then introduces some useful statistical techniques such as t-tests and linear regression.

    A good way to get started using R in the healthcare environment is to take a simple but repetitive task that you currently do in Excel or SQL and translate it into a script that can be used time and time again. This might be something such as data cleaning and transformation or producing a series of graphs and tables. The list of tutorials will keep growing as we find out which topics you would like to learn to implement in R. So please do get in touch through the website or twitter (@PenCLAHRC) and let us know if there is tutorial you would like us to write for you.

    Over the last 18 months we have also created a large amount of training materials on a variety of topics for use on our innovative Health Service Modelling Associates programme (www.health-modelling.org/). One example is the www.pythonhealthcare.org website which is a free resource for using the programming language python for healthcare related applications. There are topics to help you go from a complete beginner to advanced level programmer in no time! Over the coming months we will be making all of our training materials available for free online in the areas of: introducing simulation modelling in healthcare, problem structuring for simulation modelling, system dynamics modelling, discrete event simulation, network analysis and machine learning, so watch this space!

    This blog was written by Dr Sean Manzi, Research Fellow at NIHR CLAHRC South West Peninsula.

  • Format ONS spreadsheet

    Background

    A Public Health consultant colleague Ian Bowns (@IantheBee) created a report to monitor mortality within the Trust and he used the ONS weekly provisional data for the East Midlands to compare the pattern and trends of deaths over time. This involves downloading a file from:

    https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales

    which is updated weekly. Once a month I, manually, add numbers from this to another spreadsheet to be imported to R for the overall analysis.

    Downloaded file formats

    You may be familiar with ONS and other NHS data spreadsheets format and if you are not, here are some of the issues:

    • data is presented in wide form, not long form (so columns are rows and vice versa)
    • the sheets are formatted for look rather than function with numerous blank rows and blank columns
    • there are multiple sheets with information about methodology usually coming first. This means a default upload to programmes like R are not possible as they pick up the first sheet/tab
    • the file name changes each week and includes the date which means any code to pick up a file needs to be changed accordingly for each load
    • being Excel, when this is imported into R, there can be problems with the date formats. These can get lost to the Excel Serial number and
    • they include a lot of information and often only need a fraction of it

    Given these difficulties there is great temptation, as happened with this, to just copy and paste what you need. This isn’t ideal for the reasons:

    • it increases the likelihood of data entry input error
    • it takes time and
    • it is just so very very tedious

    The solution is, always, to automate and tools like Power Pivot in Excel or SSIS could work but as the final report is in R it makes sense to tackle this formatting in R and this is the result…

    Import file

    For this you can either save the file manually or use the following code within R. Save it to the same place where the code is running and you should see the files in the bottom right window under the tab ‘Files’. The best way to do this is using project and opening up the script within that project.

    download.file("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2019/publishedweek082019.xls",
                  destfile = "DeathsDownload.xls",
                  method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
                  mode = "wb") #wb means "write binary"

    Not that this file’s name and URL changes each week so the code needs changing each time it is run.

    Once the file is saved use readxl to import which means the file doesn’t need its format changing from the original .xls

    When I upload this file I get warnings which are related, I think, to the Excel serial numbers appearing where dates are expected.

    • sheet = :refers to the sheet I want. I think this has to be numeric and doesn’t use the tab’s title.
    • skip = :is the number of top rows to ignore.
    DeathsImport <- read_excel("DeathsDownload.xls ", 
                               sheet = 4,
                               skip = 2)
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in C5 / R5C3: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in F5 / R5C6: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in G5 / R5C7: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in H5 / R5C8: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in I5 / R5C9: got a date
    ## New names:
    ## * `` -> `..2`

    Formatting the data

    The next code creates a list that is used in the later code that is similar to the SQL IN but without typing out the list within the code for example:

    • SQL : WHERE city IN (‘Paris’,‘London’,‘Hull’)
    • R : filter(week_number %in% filter)

    These lines of code are base R code and so don’t rely on any packages.

    LookupList <- c("Week ended",
                "Total deaths, all ages",
                "Total deaths: average of corresponding",
                "E12000004"
    )

    The next bit uses the dplyr package, which has loaded as part of tidyverse, as well as the janitor package. Not all packages are compatible with tidyverse but many do as this is often the go-to data manipulation package.

    As an aside the %>% is called a pipe and the shortcut is Shift + Ctrl + m. Worth learning as you’ll be typing a lot more if you type out those pipes each time.

    Janitor commands

    • Clean names: removes spaces in column headers and replaces with _
    • remove_empty: gets rid of rows and columns – this dataset has a lot of those!

    Dplyr command

    • filter: is looking just for the rows with the words from the list ‘LookupList’. These will become the column names later.
    DeathsImport2 <- DeathsImport %>% 
      clean_names %>% 
      remove_empty(c("rows","cols")) %>% 
      filter(week_number %in% LookupList) 

    There are great commands called gather and spread which can be used to move wide form data to long and vice versa but with this I noticed that I just needed to turn it on its side so I used t() which is also useful as it turns the data frame to a matrix. You can see this by looking in the ‘Environment’ window in the top right of R Studio; there is no blue circle with an arrow next to t_DeathsImport.

    t_DeathsImport <- t(DeathsImport2)

    Being a matrix is useful as the next line of code makes the first row into column headers and this only works on a matrix.

    colnames(t_DeathsImport) <- t_DeathsImport[1, ]

    Dplyr gives an error on matrices:

    Code:

    t_DeathsImport %>% mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(Week ended)), date_system = “modern”))

    Result:

    Error in UseMethod(“mutate_”) : no applicable method for ‘mutate_’ applied to an object of class “c(‘matrix’, ‘character’)”

    As later code will need dplyr turn the matrix into a dataframe using some base R code:

    t_DeathsImport <- as.data.frame(t_DeathsImport)

    Previous dplyr code filtered on an %in% bit of code and it’s natural to want a %not in% but it doesn’t exist! However, cleverer minds have worked out a function:

    '%!ni%' <- function(x,y)!('%in%'(x,y))

    The text between the ‘’ can be anything but I like’%ni%’ as it’s reminiscent of Monty Python.

    Because of the moving around of rows to columns the data frame now has a row of column names which is not necessary as well as a row with just ‘East Midlands’ in one of the columns so the following ‘remove’ list is a workaround to get rid of these two lines.

    remove <- c("E12000004", "East Midlands")

    The next code uses the above list followed by a mutate which is followed by a janitor command ‘excel_numeric_to_date’. This tells it like it is but, as often happens, the data needs to be changed to a character and then to numeric. The date system = “modern” isn’t needed for this data but as I took this from the internet and it worked, so I left it.

    An error will appear about NAs (nulls).

    t_DeathsImport <- t_DeathsImport %>% 
      filter(E12000004 %!ni% remove) %>% 
      mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern"))
    ## Warning in excel_numeric_to_date(as.numeric(as.character(`Week ended`)), :
    ## NAs introduced by coercion

    Now to deal with this mixing of real dates with Excel serial numbers.

    Firstly, the following code uses base R to confirm real dates are real dates which conveniently wipes the serial numbers and makes them NAs.

    t_DeathsImport$`Week ended` <- as.Date(t_DeathsImport$`Week ended`, format = '%Y-%m-%d')

    This results in two columns:

    • Week ended which starts off with NAs then becomes real dates and
    • serialdate which starts off with real dates and then NAs.

    The human eye and brain can see that these two follow on from each other and just, somehow, need to be squished together and the code to do it is as follows:

    t_DeathsImport <- t_DeathsImport %>% 
      mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    To translate the mutate, this creates a new column called date which, if the Week ended is null then takes the serial date, otherwise it takes the Week ended.

    Interestingly if ‘ifelse’ without the underscore is used it converts the dates to integers and these are not the same as the Excel serial numbers so use ‘if_else’!

    And that’s it.

    Or is it?

    You might want to spit out the data frame back into excel and that’s where a different package called openxlsx can help. As with many things with R, “other packages are available”.

    write.xlsx(DeathsImport, 'ImportProvisionalDeaths.xlsx')

    If you haven’t used a project (which is really the best way to work) this will probably save in some obscure C: drive that you’ll see in the bottom left ‘Console’ just under the tab names for ‘Console’ and ‘Terminal’. Using projects means you set the pathway and that will mean the file saves in the same place and will also appear in the bottom right panel under ‘Files’.

    Feedback

    I’m pretty early on in my journey in R and many of my colleagues still haven’t started yet so I’m throwing this out there so everyone can see it, newbies and old hands alike. If you spot anything, can explain anything further, need more explanation or can offer any alternatives to what I’ve done please please feel free to comment.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

  • Using R to track NHS winter pressures

    Every Thursday during winter, roughly from December to March, NHS Digital releases a week’s worth of hospital performance data, known as the Daily Situation Reports. This data often receives media attention because cold weather and seasonal illnesses can lead to higher demand for hospital care, meaning that services might be under increased pressure. At the Health Foundation, one of our aims is to provide new insight into the quality of health care through in-depth analysis of policy and data. So, to understand more about the current demand for hospital care and the challenges the NHS is facing, we keep a close eye on the latest seasonal trends.

    Keeping on top of NHS winter indicators has the potential to keep us analysts busy. The raw data is published in a fairly complex spreadsheet, requires a decent amount of cleaning and needs to be reanalysed after every release. In addition to monitoring national figures, this winter our team at the Health Foundation also wanted to see if there might be any variation between different areas of England. Sustainability and transformation partnerships (STPs) are areas where health and care leaders develop shared proposals for local services. Therefore, we enriched the raw data with information about where hospitals are located, and which STP they belong to. But with a similar analytical approach, more fine-grained local structures (such as Clinical Commissioning Groups) could be used.

    For a more efficient and reproducible way of tracking NHS winter indicators this year, we moved to our whole analysis pipeline to R. We then used the clean data for visualisations in R and other applications, like Tableau. This blog takes you through our analysis workflow and describes how we got through some tricky steps. The complete R script is also available on GitHubif you want to give it a go yourself. You can also read a blog on the Health Foundation’s website to find out why we looked at local areas this year and what we found.

    Why write this blog on data analysis?

    Analysts at many other local and national organisations are interested in NHS winter performance data. In order for this blog to be a good resource for them, we plan to:

    • share our own analytical approach and R code
    • illustrate how and why we made certain analytical decisions and
    • discuss what we learned along the way, both about the data and R.

    We hope that this blog will inspire others to do the same, and to share their code too. Here, we won’t try to interpret any regional differences in winter performance. We know that there are several factors involved, so we will leave this up to others with more local knowledge.

    What does this blog cover?

      1. R packages we used
      2. Data download from NHS Digital
      3. Data import from R-unfriendly spreadsheets: how we tackled multi-line headers with merged cells
      4. Data cleaning: how we defined and dealt with missing data
      5. Feature engineering: how we added organisational information on sustainability and transformation partnerships (STPs)
      6. Aggregation: monthly averages within local STPs
      7. Visualisation: how we mapped STP-level data in R
      8. What have we learned?


    1. R packages we used

    The tidyverse collection of R packages is a useful set of tools for data analysis and visualisation that are designed to work together. It contains the ggplot package, which we use for visualisations. (Need help getting started with R and the tidyverse? Try the R for Data Science website).

    We use a function from the readxl package to import Excel files and the lubridate package, which makes working with dates a lot easier. Both are part of the tidyverse, but not loaded automatically along with the tidyverse package.

    The geojsonio package is one option that helps us import geographic data structures, such as STP boundaries. The broom and maptools packages are then needed to prepare these for visualisation with ggplot.

    https://gist.github.com/fiona-grimm/8b45642fc4ca8c92bf9294dbcd6be08c

    Note: if you would like to run this code in your own R session and you haven’t installed these packages already, you will need to do so (once) using the install.packages() function.

    2.  Data download from NHS Digital

    After setting up the tools, we downloaded Winter Daily SitRep 2018–19 Data from the NHS Digital website. Rather than opting for the spreadsheets that contain one week of data each, we went for the more convenient time series file, which contains all data from this winter up to the latest release. One drawback is that the name of this file changes weekly as new data is added (so, should you try to use this code at a later date, you will probably have to adapt the file name).

    https://gist.github.com/fiona-grimm/101ef292cd56a92cc8222bb8a0fa69ae


    3. Data import from R-unfriendly spreadsheets

    How we tackled multi-line headers with merged cells

    Once we opened the file in Excel, we saw a number of sheets containing different indicators. With a few exceptions, most have a similar table structure. As we were interested in patient flow through hospitals, we focused on ‘General and acute beds’ and ‘Beds occupied by long-stay patients’ for now.

    What the sheets with these indicators had in common was that there was metadata in the first 13 lines followed by a two-line header. Several columns containing variables (second header line) were grouped within dates (first header line) and the cells around the dates were merged. There were also some empty columns and rows, which we addressed later on.

    Example of the Excel layout of Winter SitRep data to be imported into R. Source: NHS Digital.

    Unfortunately, this was not yet a tidy table, as the dates that served as column headers were values, not variable names. All this made importing the file into R slightly more challenging. We fixed this by creating a custom import_sitrep() function that would:

    1. read and store the data and the two lines of the header separately,
    2. combine the two lines of the header,
    3. add the combined header to the data and tidy up column names by removing special characters,
    4. and finally convert the table into a tidy table ready for the next step.

    But wait, there was one more tricky bit. What happened when we tried to read in the headers one by one?

    https://gist.github.com/fiona-grimm/756b7b827895f58579c810c252e6c384

    *Sigh*… they ended up not having the same length. The first header line (containing the dates) was 8 elements shorter. Looking at its left and right side (see above) gave us a hint as to why this might have happened:

    • In the Excel sheet, the first few cells in this line were empty and when the line was read in, they were converted to NA. The read_xlsx() function then discarded these empty columns (probably) because they were at the beginning.
    • There were also some merged cells. During import they were separated and, if empty, converted to NA. Empty columns at the end of the header line also seem to be discarded by read_xlsx().

    So, we needed to find a way to preserve the length of the first header line in our import_sitrep() function. This is how we solved it:

    https://gist.github.com/fiona-grimm/f49eecefbe57faddfc0b2abefa9c543e

    Now that we had our import function, we were ready to read and combine the sheets containing the winter indicators ‘General and acute beds’ and ‘Beds occupied by long-stay patients’.

    https://gist.github.com/fiona-grimm/b5e444590367aec6ed968a8663138973

    The data was now tidy, as each variable formed a column and each observation formed a row. Note that the observations for England in the table were followed by values for individual hospital trusts.

    4.  Data cleaning

    Trust exclusions

    We excluded the three children’s hospitals when calculating aggregate measures, such as the average bed occupancy within STPs. Our reasoning was that their patient profiles would be different from other acute trusts and this might skew the averages. Nevertheless, we kept track of them at a trust level.

    This applies to Birmingham Women’s and Children’s NHS Foundation Trust (code RQ3), Alder Hey Children’s NHS Foundation Trust (RBS) and Sheffield Children’s NHS Foundation Trust (RCU).

    https://gist.github.com/fiona-grimm?page=2

    How we approached defining ‘missingness’: when is a zero not a zero?

    Data collection and validation errors do happen, so finding suspicious data points is an important step during data cleaning.

    While this is easy if a value is missing (or NA), it’s much harder to decide whether a zero truly represents ‘zero events’ or a missing value (in fact, it could even be both within the same data set). To distinguish between the two, at the Health Foundation we came up with the following criteria:

    • How likely is a ‘zero event’ for an indicator? For example, when counting beds in a large hospital the likelihood of having zero open seems small, but when counting long-stay patients having none seems possible.
    • How consistent is the zero value, in that trust, over time? Or in plain English: does the value jump from a higher number to zero (and back) or is it hovering somewhere close to zero.

    The next two sections describe how we found and dealt with these missing values.

    Finding longer periods of missing data

    If any hospital trust had missing values, in any indicator, on 4 or more consecutive days during the reporting period, it was excluded from the analysis. We were only looking for these periods in variables where we would not expect any zeros (the list is shown as cols_to_check).

    Why this particular cut-off? We wanted to aggregate the data and calculating weekly averages did not seem justified if we were missing more than half of a week for any trust.

    Here’s how we summed up how many consecutive days were zero or NA within each trust/variable combination:

    https://gist.github.com/fiona-grimm/9d4ca6567b605118544dd24b709a177f

    https://gist.github.com/fiona-grimm/d436454c392ea9bf78ae3ee9083310d4

    When we filtered for 4 or more consecutive days, we found that:

    • The trust with the code RTD reported zero long-stay patients (of any length of stay) for the whole reporting period to date, which seemed unrealistic for a general and acute hospital.
    • Trust RQW had a gap of 7–8 days, that coincided for the indicators shown (we checked this separately in the raw data).
    • Trust RAX reported zero long-stay patients (of any length of stay) for 6 days during January, but reported a high number before and after.

    Based on this, all variables from the trusts RTD, RQW and RAX were excluded from the analysis of this year’s (2018/19) winter data. This left us with 128 out of 134 trusts.

    It’s worth noting that with this data-driven approach different trusts might be excluded each year and the number of excluded trusts could change over the winter period as new ‘gaps’ appear. Keep this in mind when making comparisons, both throughout the winter period and between years.

    https://gist.github.com/fiona-grimm/ed0e2c42d2fe8cbbb72b3070f99444f0

    Dealing with shorter data gaps

    Next, we checked how many missing or zero values were left:

    https://gist.github.com/fiona-grimm/e89a26740095479cea7b0dac783048d7

    Most of the remaining gaps (42 out of 54) consisted of only a single day and they were mostly found in variables relating to long-stay patients. To judge whether these looked like real ‘zero events’ or were more likely to be reporting errors, we had a closer look at the data:

    https://gist.github.com/fiona-grimm/8e37c125cb239b2ce00281acedbeaf9c

    Before cleaning: plots showing the subset of trusts reporting ‘0’ in any (non-derived) variable.

    Based on the data before and after the zeroes, these were unlikely to be true values. It would have been possible to impute these gaps in some way, for example by taking the mean of day before and the day after. Instead, we took the approach that required fewest assumptions and we just replaced the gaps with NA:

    https://gist.github.com/fiona-grimm/70f002e5b2d169bf629fc8463a059946

    After cleaning: plots showing the same trusts after replacing zeros with NA.

    Validating derived variables

    Some variables present in the data set were derived from others: for example, total.beds.occd should be the sum of core.beds.open and escalation.beds.open.

    We could have discarded derived variables straightaway and then computed them ourselves, in order to be completely sure about how they have been derived and what they mean. Since we were curious about their quality, we first checked if total.beds.open and occupancy.rate had been calculated as expected so we could decide whether or not to replace them (spoiler: yes, we did).

    https://gist.github.com/fiona-grimm/d7be2ac9b3f83596f48733a65f525063

    Similarly, if it had been the focus of our analysis, we would also have rederived national aggregates for England.

    5.  Feature engineering

    Adding organisational information on sustainability and transformation partnerships (STPs)

    As we wanted to compare indicators between local areas, we decided to calculate averages of winter indicators over hospital trusts within each STP. To do this, we needed to add variables to the raw data that indicated which hospital trust belonged to which STP. Unfortunately, we were not aware that this information was available in a format convenient for data analysis.

    So, we manually created and validated a lookup table to map hospital trusts to STPs, using information related to the 2017/18 formation of 44 STPs from NHS England. While some STPs have since changed (for example, three STPs in the north of England merged in August 2018), this was the latest and most comprehensive information available, as far as we are aware.

    The allocation of most hospital trusts to STPs was straightforward using this list, but there were a few instances where we had to choose:

    • If a trust did not appear in any STP list, it was matched according to the location of its main acute site. This was the case for four trusts in Northumberland, Tyne and Wear and North Durham STP.
    • If a trust was mentioned in more than one STP plan, it was allocated according to the location of its main acute site. This applied to both Chesterfield Royal Hospital NHS Foundation Trust and Epsom And St Helier University Hospitals NHS Trust.

    We think this is a reasonable approach when analysing winter indicators, which mainly come from acute trusts, but we would be keen to hear your feedback.

    Once we had this lookup table, we imported it into R and merged it with the winter indicators:

    https://gist.github.com/fiona-grimm/b7fd87ba3c8ecbc1236c48c6ef8b19a0

    The lookup table is also available on Github. Please note that STPs change and develop over time, so if you would like to use it, it’s worth checking that the information is up to date.

    Making read-outs comparable between trusts: from raw counts to rates

    As hospital trusts come in different sizes and serve different numbers of patients, raw patient counts are not suitable for comparisons between trusts or regions. Percentages or fractions, such bed occupancy rates, are more comparable.

    Therefore, we derived the fraction of occupied beds, which are occupied by long-stay patients over 7, 14 or 21 days:

    https://gist.github.com/fiona-grimm/e984a41eb48058ef4ca5254412f8aa7e

    6.  Aggregation: monthly averages by STP

    In some cases, it might be useful to visualise daily changes, but weekly or even monthly aggregates have the advantage of being less noisy, free of weekday-weekend variation and can potentially be more useful to monitor longer-term trends.

    First, we created a new column that contained the corresponding month. The month was then used as grouping variable, along with the trust or STP code, to calculate monthly averages of bed occupancy and long-stay patient rates.

    For weekly averages, an alternative would have been to create a new column containing the date of the respective beginning of the week using the cut() function (also shown below).

    We know it’s also good practice to keep track of the number of valid observations (as in, not NA) that we average over within each group used. In this instance, for trust-level aggregates, this represented the number of days within a week. For STP-level aggregates, it corresponded to the number of trusts within the STP.

    https://gist.github.com/fiona-grimm/fed7f70f891f9a017dd7bb097e6f2173

    At this point, we also saved the tidy and aggregated data as a CSV file for visualisation in other programs, such as Tableau.

    7.  Visualisation: how we mapped STP-level data in R

    There are endless ways to creatively visualise aspects of this data (the R Graph Gallery is a great place to get inspired). We wanted to plot a map of STP boundaries and colour-shade them according to the average bed occupancy in each winter month.

    STP boundaries are available as a GeoJSON file from the Office for National Statistics (ONS). We downloaded and imported the digital vector file and then created a plot to get a first idea of what was in the file:

    https://gist.github.com/fiona-grimm/db0b883148b7f95bc3546cf2721e83df

    Boundaries of Sustainability and Transformation Partnerships in England (2017). Data source: ONS.

    Before spatial objects can be used with ggplot, they have to be converted to a data frame. This can be done using the tidy() function from the broom package. To add our data to the map, we then merged the resulting data frame with the winter data.

    The cut() function provided a convenient way to divide the variable bed.occupancy into meaningful intervals and to add labels that could be displayed on the map. Converting variables into factors and ordering the factor levels using factor() ensured that the intervals and months were in the right order. We were then ready to plot and save the map:

    https://gist.github.com/fiona-grimm/f4c7437b18d73c73c4e7ad16bbae479e

    How could we have improved this map further? One option might have been to interactively display STP averages when hovering over the map using the plotly package, for example.

    8.  What have we learned?

    Lots! No, seriously… Using this workflow, rather than copy and paste, has saved us a lot of time this winter. But beyond that, creating (then revisiting and improving) this workflow turned out to be a great way to work together and to make our analysis more transparent, more reproducible and easier to understand.

    Key points we are taking away:

    • With trusts, CCGs, STPs, ICSs, and more to choose from when we’re looking at local variation within health care systems, it can be challenging to pick the right organisational level. Local health care structures are also changing and evolving rapidly. We need to do more thinking about which level is most informative and compare different option, suggestions are welcome.
    • Some cleaning is a must. Winter data has a quick turnaround and NHS Digital only perform minimal validation. However, compared to previous years, the 2018/19 winter data quality has markedly improved (you can use our code to check the improvement for yourself).
    • But cleaning decisions impact interpretability. If the list of excluded trusts changes between years (based on how well they are reporting in any given year), this makes the data less comparable year-on-year.
    • We’ll play it safe with derived variables.The effort of rederiving them ourselves was worth it for the peace of mind we got from knowing that they were calculated consistently and meant exactly what we thought they meant.
    • Next winter, our future selves will be very happy to have our analysis nicely documented with code that is ready to go. It should even be easy to adapt the same workflow for other open NHS data sets (we’ll save this for another time).

    This blog was written by Fiona Grimm, Senior Data Analyst in the Data Analytics team at the Health Foundation. @fiona_grimm

  • Animating a Graph Over Time in Shiny

     

     

     

     

     

     

     

     

     

     

     

     

    I was interested in creating an animated graph in Shiny, which allows for an interactive input. As a mock example, I created an app which shows the distribution of a waitlist’s volume over time.

    Excel offers the ability to easily create graphs, though when it comes to animating graphs, the process can be tricky, using complicated VBA. In my opinion, R is far better suited to dealing with this, due to the object oriented nature.

    Simple, right?

    When I first had a go with making a simple animated plot, I thought it would be as simple as:

    • Create vector of distribution
    • Plot this vector as a bar chart
    • Update the new vector every time there is a trigger

    Unfortunately, that’s when I ran into reactivity.

    Reactivity

    In short (and to my understanding), if you have anything in R Shiny that will be updated, it should be within a reactive environment. Flows of information in R work by pull, rather than push mechanisms. This means the data is “looked at” only when needed by the output (ie when it is created), but the data doesn’t push forward any change to the output if it updates.

     

     

     

     

     

     

     

     

     

     

    Above, case 1 represents the normal situation in R once data has been updated. In case 2, a reactive environment is used, which effectively reverses the direction of the flow of information in R.

    Another, perhaps more relatable way of interpreting reactivity, is to imagine reading in a csv file to a table in R called, for example, my_table. If you were to update values in the csv file, the object my_table would not be updated; it would require the data to be re-read into R.

    Reactive objects can live inside something called reactiveValues, which acts like a list. There are 2 ways to create / update this:

    • Create the reactive list using my_reactive_list <- reactiveValues(), then update values in that list using $ (eg my_reactive_list$somedata <- c(1,20,23,42,98) )
    • Create the items while creating the list (eg my_reactive_list <- reactiveValues(somedata = c(1,20,23,42,98)) )

    To then use this data, simply call the object from the output you would like to use it for. For example, ggplot(my_reactive_list$somedata). When somedata changes, so will the plot output.

    It is worth noting that, just like lists, items within the reactiveValues object are not limited to vectors.

    More about reactivity can be found here:

    https://shiny.rstudio.com/articles/understanding-reactivity.html

    Code

    Here is the code I used to create this app:

    https://github.com/danyaalmohamed/R/blob/master/animatedshiny.r

    Though the code is long, a large proportion of that is for inputs and layouts; a basic animated graph can be created using some more simple code:

    • Data in reactiveValues() form
    • A plot output
    • A trigger, you could use an actionButton(), or a reactiveTimer()

    Shiny consists of a ui and a server element. Any visuals (eg a button, a plotoutput) should be held within the ui, then any functions or calculations which happen behind the scenes should live within the server function.

    I generally find it useful to use a series of small functions in R, and especially in Shiny. Instead of putting all the code inside of an actionButton, calling a function allows scope to implement it with different triggers, such as a reactiveTimer, or looping. It also allows for easier reading, in my opinion.

    When initially attempting to create the app, I was taken back by the complexity of Shiny, and faced a steep learning curve. I took a few days to learn the basics, and after some practice I feel I know what is available in the Shiny toolbox, and some ways to work around problems.

    This blog was written by Dan Mohamed, Undergraduate Analyst at the NHS Wales Delivery Unit.

  • A run chart is not a run chart is not a run chart

    Understanding variation using runs analysis

    Run charts are simple and powerful tools that help discriminate between random and non-random variation in data over time – for example, measures of healthcare quality.

    Random variation is present in all natural processes. In a random process we cannot know the exact value of the next outcome, but from studying previous data we may predict the probability of future outcomes. So, a random process is predictable. Non-random variation, on the other hand, appears when something new, sometimes unexpected, starts to influence the process. This may be the result of intended changes made to improve the process or unintended process deterioration. The ability to tell random from non-random is crucial in quality improvement. One way of achieving this is runs analysis.

    In general, a run is defined as a sequence of like items that is preceded and followed by one or more items of another type or by no item. Items can be heads and tails, odd and even numbers, numbers above and below a certain value, etc.

    Runs analysis is based on knowledge of the natural distributions and limits of run lengths and the number of runs in random processes. For example, if we toss a coin 10 times and get all heads (1 run of length 10), we would think that something non-random is affecting the game. Likewise, if a run chart includes a run of 10 data points on the same side of the centre line, we would start looking for an explanation.

    Figure 1: Run charts from random numbers. A: random variation. B: non-random variation in the form of an upwards shift introduced after data point number 16 and identified by unusually long and few runs and signalled by a red, dashed centre line. See text for details on how to identify non-random variation.

    Specifically, a run chart may be regarded as a coin tossing game where the data points represent heads and tails depending on their position above and below the centre line (ignoring data points that fall directly on the centre line). If the process is random, the data points will be randomly distributed around the centre (Figure 1A). A shift in process location will affect the distribution of data points and will eventually present itself by non-random patterns in data, which can be identified by statistical tests (Figure 1B).

    Swed and Eisenhart studied the expected number of runs in random sequences. If the number of runs is too small or too large, it is an indication that the sequence is not random (1). To perform Swed and Eisenhart’s runs test one must either do rather complicated calculations or look up the limits for the expected number of runs in tables based on the total number of runs in the sequence and the number of items of each kind. Simplified tables for use with run charts have been developed for up to 60 data points (2). For more than 60 data points, the limits can be calculated using the normal approximation of the runs distribution function (3). For example, in a run chart with 24 data points, the expected number of runs (95% prediction limits) is 8-18.

    Chen proposed an alternative to Swed and Eisenhart’s method. Instead of counting the number of runs, Chen counts the number of shifts in the sequence, i.e. when a value of one kind is followed by a value of another kind, which is one less than the number of runs (4). To avoid confusing Chen’s shifts in sequence with shifts in process location, I use the term crossings.

    In run charts, crossings are easily counted by counting the number of times the graph crosses the median line. If the process is random, the chance of crossing or not crossing the median line between two adjacent data points is fifty-fifty. Thus, the total number of crossings has a binomial distribution, b(n1,0.5), where n is the number of data points and 0.5 is the success probability.

    We should consider whether we are interested in both sides of the distribution, too few and/or too many crossings. By nature, a shift in process location following process improvement or deterioration will result in fewer crossings than expected. But unusually many crossings (oscillation) is also a sign of non-random variation, which will appear if data are negatively autocorrelated, that is, if any high number tends to be followed by a low number and vice versa. However, oscillation is not an effect of the process shifting location, but most likely a result of a poorly designed measure or sampling issues (5 p175). Chen recommends using one-sided tests suited for the purpose of the analysis, i.e. whether one is interested in detecting shifts or oscillations (4).

    For example, for a run chart with 24 data points we could choose the lower fifth percentile of the cumulative binomial distribution of 23 trials with a success probability of 0.5 as our critical value for the lower limits of crossings. This is easily calculated in R using the qbinom() function, qbinom(p = 0.05, size = 24 - 1, prob = 0.5) = 8, i.e. fewer than 8 crossings would be unusual and suggest that the process is shifting. In Figure 1B non-random variation in the form of a shift is identified by the fact that the chart has only 6 crossings when at least 8 would be expected from 24 random numbers.

    The number of crossings (and runs) is inversely related to the lengths of runs. All things being equal, fewer crossings give longer runs and vice versa. Therefore, a test for unusually long runs is also commonly used as a means to identify shifts. A simple example is the “classic” rule of thumb of a run of 8 or more data points on the same side of the centre line. But just like the expected number of crossings, the expected length of the longest run depends on the total number of data points. In a run chart with, say, 100 data points, we should not be surprised to find a run of 8.

    The distribution of longest runs has been described in detail by Schilling (6–8). The expected length of the longest run either above or below the median is log2(n), where n is the total number of data points, excluding data points that fall directly on the centre line. Approximately 95% of the longest runs are predicted to be within ±3 of the expected value. For the purpose of detecting a shift, we are interested in the upper prediction limit for longest run, which is log2(n)+3 (rounded to the nearest integer). For example, in a run chart of 24 data points, the upper 95% prediction limit for the longest run is round(log2(24) + 3) = 8, i.e. a run of more than 8 indicates a shift. Figure 1B has an unusually long run of 9 consecutive data points on the same side of the centre line.

    A trend is a special form of a run, where like items are defined as data points that are bigger or smaller than the preceding one. The trend test was developed by Olmstead who provided tables and formulas for the probabilities of trends of different lengths depending on the total number of data points (9). For example, with less than 27 data points in total, the chance of having a trend of 6 or more data points going up or down is less than 5%. Note that Olmstead defines a trend as the number of jumps rather than the number of data points that surround the jumps.

    In summary, there are (at least) four unusual run patterns that may be used to identify non-random variation in run charts:

    • Too many runs
    • Too few runs
    • Too long runs
    • Too long trends

    The selection of rules and the choice of critical values to define too manytoo few and too long have significant influence on the statistical properties of run charts. This is the subject of the following sections.

    Signal confusion

    Compared to control charts, surprisingly little has been published on the statistical properties of run charts (2, 10–13).

    Carey introduced four run chart rules(10):

    1. too much variability: more runs than expected,
    2. too little variability: fewer runs than expected,
    3. a shift: a run of 8 or more data points on the same side of the median (if less than 20 data points in total, a run of 7 suggests a shift),
    4. a trend: a run of 6 (or 7) data points going up or down.

    Carey provided a probability table showing the expected range of number of runs based on Swed’s work (1).

    Perla, Provost, and Murray suggested a different set of run chart rules (2):

    1. a shift: a run of 6 or more data points on the same side of the median,
    2. a trend: a run of 5 data points going up or down,
    3. too much variability: more runs than expected,
    4. too little variability: fewer runs than expected,
    5. an astronomical data point: an unusually large or small number.

    Perla’s rules differ from Carey’s mainly in that the critical values used to detect shifts and trends are relaxed. Perla notes that rules 1-4 are based on an α error of p<0.05. However, the source of this information is unclear. Also, Perla does not tell whether the α error applies to each individual rule or the set as a whole.

    Finally, Olesen and I have proposed yet another set of run chart rules, which – for lack of a better term – I refer to as the Anhøj rules (11):

    1. Shift rule: A shift is present if any run of consecutive data points on the same side of the median is longer than its prediction limit.
    2. Crossings rule: A crossings signal is present if the number of times the graph crosses the median is smaller than its prediction limit.

    The prediction limits for longest run and number of crossings both depend on the number of useful observations (data points not on the median) and are based on the works of Schilling (8) and Chen (4) as described in the previous section. Prediction limits for 10-60 data points are tabulated in the appendix.

    In general, the Anhøj rules are more conservative (less sensitive, more specific) than the Carey and Perla rules. For example, with 24 data points the Anhøj rules take a run of 9 to signal a shift, while Perla and Carey only take 6 and 8, respectively (Figure 2).

    Figure 2: Run chart with random variation according to the Carey and Anhøj rules but with both a shift and a trend according to the Perla rules. The chart has 30 observations of which 27 are not on the median. The longest run has 6 data points below the median, there are 11 crossings, and the longest trend has 6 data points (excluding a tie) going down.

    Which rules to rule?

    So, which run chart rules are best? To answer that question, we need to agree on what we mean by “best”. The purpose of any data display is to help us make good decisions. The link between data and decisions is prediction. As W Edwards Deming put it A curve fitted to a set of points is of interest, not on account of the data fitted, but because of data not yet fitted. How will this curve fit the next batch of data?. Specifically, regarding run charts, the question of interest is: how well does the centre (e.g. median) of historical data represent the centre of future data? In a random process, historical data reliably predicts future data. In a non-random process, this may or may not be the case. In practice, however, we cannot know in advance if a process is random or non-random, which is why we need statistical tests (rules) to help us make the destinction. But even the most reliable statistical tests make mistakes by sometimes overlooking the condition being tested for (false negative) and sometimes identifying a condition that is not there (false positive).

    The questions of interest for run chart users are what is the chance that a chart with a positive test really represents non-random variation? and what is the chance that a chart with a negative test really represents random variation?

    Likelihood ratios are diagnostic measures designed to answer such questions. The positive likelihood ratio, LR+, is the true positive rate divided by the false positive rate. The negative likelihood ratio, LR–, is the false negative rate divided by the true negative rate:

    LR+=TP/FP=sensitivity/(1specificity)
    LR=FN/TN=(1sensitivity)/specificity

    A likelihood ratio greater than 1 speaks in favour of the condition being tested for, which in our case is non-random variation, while a likelihood ratio less than 1 speaks against the presence of non-random variation. The further a likelihood ratio is from 1, the more or less likely is the presence of non-random variation.

    Thus, likelihood ratios allow us to quantify the probability of non-random variation in data sequences and are useful quality characteristics of run (and control) chart rules.

    Figure3: Likelihood ratios of different sets of run chart rules. See text for details.

    In Run chart diagnostics I studied and compared likelihood ratios for the Perla, Carey and Anhøj rules (12) using simulated random data series to mimic run charts.

    In half the simulations, a shift in process mean of 2 SD was introduced. So, for each simulated run chart the true presence or absence of a shift together with the presence or absence of signals from the runs analyses was known by the simulation program allowing the program to calculate likelihood ratios for each set of run chart rules.

    For each simulated data series, the median was calculated using the first 6, 12 or 18 data points as baseline. And the shifts (when present) were introduced in the post-baseline period of 6, 12 or 18 data points. Consequently, there were nine combinations of baseline and post-baseline periods of different lengths allowing me to study the influence of these parameters on the diagnostic value of the tests.

    For each of the nine possible combinations of baseline and post-baseline length, 1,000 simulations were performed with and 1,000 without post-baseline shift. In total, 18,000 run charts were simulated.

    The results are summarised in Figure 3. Overall, the Anhøj and Carey rules perform “better” than the Perla rules – the Anhøj rules slightly but consistently better than the Carey rules. For run charts with 12 or more data points in the post-baseline period, the Anhøj and Carey rules perform very well with positive LRs around 10 and negative LRs around 0.1. The interpretation is, that given a positive test based on the Anhøj or Carey rules, the presence of a shift of 2 SD is about 10 times more likely than no shift; and given a negative test, a shift is about 10 times less likely than no shift.

    In the previous paragraph, I use the term “better” to describe likelihood ratios with a wide and symmetric range centered around 1. Symmetry is important if we want equal protection from false positive and false negative results. The Perla rules have very low negative LRs meaning that negative tests with great certainty rule out shifts. However, they also have low positive LRs suggesting that positive tests are only 2–4 times more likely to be associated with true shifts in process performance than not.

    The main reason for the low value of positive tests signalled by the Perla rules is the trend rule, which has been shown to add little more than false signals to runs analysis(14). In particular, the Perla version of the trend rule (5 point in a row) is flawed, having a false positive signal rate around 20%. But also Perla’s shift rule signalling when only 6 data points in a row is on the same side of the median is responsible for a significant number of false alarms.

    qicharts2: Quality improvement charts with R

    The qicharts2 package for R(15) employs the Anhøj rules as standalone rules with run charts and as sensitising rules with control charts (Figure 3).

    The latest stable version of qicharts2 can be installed from CRAN:

    install.packages('qicharts2')

    The current development version can be installed from GitHub:

    devtools::install_github('anhoej/qicharts2', build_vignettes = TRUE)

    Along with the package comes a vignette that explains its use and provides several examples using actual healthcare data:

    https://anhoej.github.io/qicharts2/articles/qicharts2.html

    Run charts or control charts?

    It is a common misunderstanding that run charts are inferior to control charts. As Figure 4 clearly demonstrates, the runs analysis is more sensitive to minor persistent shifts in data than are the control limits that only reacts to larger shifts in data. In a recent paper, Tore Wentzel-Larsen and I showed that runs analysis with the Anhøj rules is actually comparable to or even better than traditional control chart rules to identify minor to moderate and persistent shifts in data (16).

    Figure 4: Control chart of random normal variables. In the last 8 data points a shift of 2 SD was introduced. All data points are within the control limits, but the shift is signalled by an unusually long run of 9 data points.

    In my view, run and control charts are two sides of the same coin, and I refer to them collectively as statistical process control charts. I recommend to always use the Anhøj rules first together with a visual inspection of the run chart for other signs of non-random variation. If – and only if – the run chart finds random variation and the extra sensitivity of control limits to identify larger, possibly transient, shifts in data is needed, a control chart may be useful.

    For improvement work, run charts are usually all we need, because we are expecting persistent shifts in data. For quality control – when a process has been stabilised at a satisfactory level – control limits are useful to quickly identify sudden larger unwanted shifts in data.

    Conclusion

    In conclusion, several sets of run chart rules with different diagnostic properties are currently available. The Perla rules recommended by the Institute for Healthcare Improvement (US) and widely used within the National Health Services (UK) have poor diagnostic value mainly due to a high false signal rate. The Anhøj and the Carey rules have better diagnostic properties that reliably tell random from non-random variation and balance the risk of false positive and false negative signals.

    Appendix: Critical values for longest run and number of crossings

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    References

    1. Swed FS, Eisenhart C. Tables for testing randomness of grouping in a sequence of alternatives. The Annals of Mathematical Statistics [Internet]. 1943 [cited 2018 Oct 7]; 14:66–87. Available from: http://www.jstor.org/stable/2236004.

    2. Perla RJ, Provost LP, Murray SK. The run chart: A simple analytical tool for learning from variation in healthcare processes. BMJ Qual Saf [Internet]. 2011 [cited 2018 Oct 7]; 20:46–51. Available from: http://qualitysafety.bmj.com/content/20/1/46.

    3. Provost LP, Murray SK. The health care data guide: Learning from data for improvement. San Francisco: John Wiley & Sons Inc.; 2011.

    4. Chen Z. A note on the runs test. Model Assisted Statistics and Applications [Internet]. 2010 [cited 2018 Oct 7]; 5:73–7. Available from: https://content.iospress.com/articles/model-assisted-statistics-and-applications/mas00142.

    5. Western Electric Company. Statistical quality control handbook. New York: Western Electric Company inc.; 1956.

    6. Schilling MF. The longest run of heads. The College Mathematics Journal [Internet]. 1990 [cited 2018 Oct 7]; 21:196–207. Available from: https://www.jstor.org/stable/2686886.

    7. Schilling MF. Long run predictions. Math Horizons [Internet]. 1994 [cited 2018 Oct 7]; 1:10–2. Available from: http://www.jstor.org/stable/25677964.

    8. Schilling MF. The surprising predictability of long runs. Mathematics Magazine [Internet]. 2012 [cited 2018 Oct 7]; 85:141–9. Available from: http://www.jstor.org/stable/10.4169/math.mag.85.2.141.

    9. Olmstead PS. Distribution of sample arrangements for runs up and down. The Annals of Mathematical Statistics [Internet]. 1946 [cited 2018 Oct 7]; 17:24–33. Available from: http://www.jstor.org/stable/2235901.

    10. Carey RG. How do you know that your care is improving? Part 1: Basic concepts in statistical thinking. J Ambulatory Care Manage. 2002; 25(1):80–7.

    11. Anhøj J, Olesen AV. Run charts revisited: A simulation study of run chart rules for detection of non-random variation in health care processes. PLoS ONE [Internet]. 2014 [cited 2018 Oct 7]. Available from: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0113825.

    12. Anhøj J. Diagnostic value of run chart analysis: Using likelihood ratios to compare run chart rules on simulated data series. PLoS ONE [Internet]. 2015 [cited 2018 Oct 7]. Available from: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0121349.

    13. Anhøj J, Wentzel-Larsen T. Sense and sensibility: On the diagnostic value of control chart rules for detection of shifts in time series data. BMC Medical Research Methodology [Internet]. 2018 [cited 2018 Oct 7]. Available from: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-018-0564-0.

    14. Davis RB, Woodall WH. Performance of the control chart trend rule under linear shift. Journal of Quality Technology. 1988; 20:260–2.

    15. Anhøj J. Qicharts2: Quality improvement charts for r. JOSS [Internet]. 2018 [cited 2018 Oct 7]. Available from: https://joss.theoj.org/papers/10.21105/joss.00699.

    16. Anhøj J, Hellesøe A-MB. The problem with red, amber, green: The need to avoid distraction by random variation in organisational performance measures. BMJ Qual Saf [Internet]. 2016 [cited 2018 Oct 7]. Available from: https://qualitysafety.bmj.com/content/26/1/81.

    This article is based on previously published, peer reviewed articles and large portions of the text has been copied from its sources. 

    Jacob Anhøj

  • A simple function to create nice correlation plots

    The problem

    I was working with a dataset where I wanted to assess the correlation of different variables in R. As much as I like R – the outputs from the console window leave something to be desired (in terms of data visualisation). Therefore, I wanted a way to visualise these correlations in a nicer / cleaner / crisper way. The solution to this is to use a correlation plot.

    Loading the correlation plot package

    The package I used for creating my correlation plots was the corrplot package, this can be installed and loaded into the R workspace by using the syntax below:

    #install.packages("corrplot")
    library(corrplot)

    At this point I would encourage you to check out help for the corrplot function, as it allows you to pass a multitude of parameters to the function.

    Deconstructing the function

    As mentioned previously, this plotting function has a multitude of uses, but all the parameters can be off putting to a newbie! This was me 6 years ago vigorously typing ‘how to do this with R relating to x’ and bombarding  stackoverflow and other useful websites with questions. Shout out to RBloggers as well!

    The function I have created uses the functionality of the corrplot packages, but it simplifies the inputs. I will include the function in stages to explain each step, however, if you just want to use the function and are not bothered with the underpinnings then skip the following section:

    Step 1 – Function parameters

    Parameters of the function are as below:

    create_gh_style_corrplot <- function(df_numeric_vals,
                                         method_corrplot,
                                         colour_min,
                                         colour_middle,
                                         colour_max="green") {

    The parameters to pass to the function are:

    1. df_numeric_vals this means a data frame of numeric values only, so any categorical (factor) data needs to be stripped out before passing the data frame to the function;
    2. method_corrplot this is a numeric range from 1 – 5. So, for a shaded correlation plot you would use 1. Further examples of the various options will be discussed when I describe how the if statement works.
    3. colour_min this uses a gradient colour setting for the negative positive correlations. An example of an input here would be “green”.
    4. colour_middle this is the middle range colour, normally I set this equal to (=) “white”.
    5. colour_max this is the colour of the strong positive correlations

    For information on the strength of correlations, refer to this simple guide.

    Step 2 – Creating the conditional (IF statement) to select correlation plot type

    The below conditional statement uses the input of the function e.g. 1-5 to select the type of chart to display. This is included in the code block below:

    library(corrplot)
    
    if(method_corrplot == 1 ){
      type_var <- "shade"
      method_corrplot = type_var 
    }
    else if (method_corrplot ==2) {
      type_var <- "number"
      method_corrplot = type_var
    }
    else if (method_corrplot ==3) {
      type_var <- "pie"
      method_corrplot = type_var
    }
    else if (method_corrplot ==4) {
      type_var <- "ellipse"
      method_corrplot = type_var
    }
    else if (method_corrplot ==5) {
      type_var <- "circle"
      method_corrplot = type_var
    }
    else{
      type_var <- "shade"
      method_corrplot <- type_var
    }

    What does this do then? Well firstly nested in the function I make sure that the corrplot library is referenced to allow for the correlation plot functionality to be used. The next series of steps repeat this method:

    • Basically, this says that if the method_corrplot parameter of the function equals input 1, 2, 3, etc – then select the relevant type of correlation plot.
    • The type_var is a variable that sets the value of the variable equal to the string stated. These strings link directly back to the parameters of the corrplot function,  as I know a type of correlation plot is equal to shade or number, etc.
    • Finally, the last step is to convert method_corrplot equal to the textual type specified in the preceding bullet.

    In essence, what has been inputted as numeric value into the parameter i.e. 1; set the type_var equal to a text string that matches something that corrplot is expecting and then set the method_corrplot variable equal to that of the type variable. Essentially, turning the integer value passed into the parameter into a string / character output.

    Step 3 – Hacking the corrplot function

    As specified in the previous sections, this function has a lot of inputs and is in need of simplifying, so that is exactly what I have tried to do. The corrplot function is the last step in my more simple function to take lots of parameters and simplify down to just 5 input parameters:

    corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
             order = "AOE",
             addCoef.col = 'black',
             number.cex = 0.5,
             tl.cex = 0.6,
             tl.col = 'black',
             col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
             cl.cex = 0.3)
    }

    Let’s explain this function.

    So, the corrplot function is the main driver for this and the second nested cor is just as important, as this is the command to create a correlation matrix. The settings are to use the df_numeric_vals data frame as the data to use with the function, the use=’all.obs’ just tells the function to use all observations in the data frame and the method=method_corrplot uses the if statement I created in step 2 to select the relevant chart from the input.  The order uses the angular ordering method and the addCoef.col=’black’ sets the coefficient values to always show black, as well as the colour of the labels. The background colour of the correlation plot uses the colorRampPalette function to create a gradient scale for the function and the parameters of each of the colour settings like to those inputs I explained in step 1.

    The full function is detailed here:

    create_gh_style_corrplot <- function(df_numeric_vals,
                                         method_corrplot,
                                         colour_min,
                                         colour_middle,
                                         colour_max="green") {
     
      library(corrplot)
     
      if(method_corrplot == 1 ){
        type_var <- "shade"
        method_corrplot = type_var 
      }
      else if (method_corrplot ==2) {
        type_var <- "number"
        method_corrplot = type_var
      }
      else if (method_corrplot ==3) {
        type_var <- "pie"
        method_corrplot = type_var
      }
      else if (method_corrplot ==4) {
        type_var <- "ellipse"
        method_corrplot = type_var
      }
      else if (method_corrplot ==5) {
        type_var <- "circle"
        method_corrplot = type_var
      }
      else{
        type_var <- "shade"
        method_corrplot <- type_var
      }
    
     
    corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
             order = "AOE",
             addCoef.col = 'black',
             number.cex = 0.5,
             tl.cex = 0.6,
             tl.col = 'black',
             col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
             cl.cex = 0.3)
    }
    }

    If you want to use the function, just copy and paste this code into a R script file and this will create the function for you. Please remember to install the corrplot package by using install.packages(corrplot).

    Utilising the function

    The example dataset I will use here is the mpg sample file provided by ggplot. Load the R script provided towards the end of the last section first, as this will create the function in R’s environment. Next, add this code to the end to look at the various different iterations and charts that can be created from the data:

    ##------------------CREATE DATASET---------------------------------------
    
    numeric_df <- data.frame(mpg[c(3,5,8,9)])
    
    #This relates to the numeric variables in the data frame to use with my function
    ##------------------USE FUNCTION-----------------------------------------
    
    create_gh_style_corrplot(numeric_df,1, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,2, "steelblue2","black", "black")
    create_gh_style_corrplot(numeric_df,3, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,4, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,5, "steelblue2","white", "whitesmoke")

    The outputs of the charts are reliant on the correlation plot type select 1-5, and the colour ranges selected. You can choose any colour and I would recommend using the command colours() in R console or script to pull up the list of colours native to R.

    How about these visualisations:

     

    Each plot can be tailored to suite your needs. I tend to like blue shades, but go all out and choose whatever colours you like. The R source code is accessible here.

    I do hope you will use this function to maximise your correlation plots – its all about relationships!

    This blog was written by Gary Hutson, Principal Analyst, Activity & Access Team, Information & Insight at Nottingham University Hospitals NHS Trust, and was originally posted here.

  • From script-based development to function-based development and onwards to Package Based development: part 2

    At the NHS R Conference, I suggested to people that they should embrace the idea of package-based development rather than script-based work.

    In the first part of this tutorial, in the fictional NHS-R Community greeting room, our humble analyst was tasked with greeting people. Rather than writing a script and needing to repeat themselves all the time with different variations of greetings and senders, they wrote a rather nifty little function to do this:

    greet_person <- function(greeting = “Hello World”, sender = “the NHS-R Community!”) {
    
      if (!is.character(greeting) {
        stop(“greeting must be a string”)
      }
      if (!is.character(sender) {
        stop(“sender must be a string”)
      }
      if (length(sender) > 1) {
        warning(“greet_person isn’t very good at handling more than one sender. It is better to use just one sender at a time.”)
      }
      message(greeting, “ from “, sender)
    }

    As we know, R is awesome and many people took up R on the background of some excellent publicity and training work by the NHS-R community. Our poor greeting team got overwhelmed by work: it is decided that the team of greeters needs to expand. There will now be a team of three greeters. Every other bit of work output from our NHS-R community team will involve greeting someone before we present our other awesome analysis to them.

    This is going to be a nightmare! How can we scale our work to cope with multiple users, and multiple other pieces of work using our greeting function.

    If we rely upon the scripts, we have to trust that others will use the scripts appropriately and not edit or alter them (accidentally or on purpose). Furthermore, if someone wants to greet someone at the beginning of their piece of analysis, they’ll either have to copy the code and paste it somewhere, or link to our script containing the function – which in turn means they need to keep a common pathname for everything and hope no-one breaks the function. Nightmare!

    Fortunately, someone attended the last NHS-R conference and remembered that package-based development is a really handy way of managing to scale your R code in a sustainable way. So after a team meeting with copious caffeine, it is decided that greet_person needs to go into a new package, cunningly named NHSRGreetings. And here’s how we’re going to do it.

    In R Studio, go to File and then to New Project. Click on New Directory, and then click on R Package. I am using RStudio 1.2 Preview for this tutorial which is downloadable from the R website. I would recommend doing this as some of the package management has been greatly simplified and some frustrating steps removed.

     

     

     

     

     

     

     

     

     

     

    Click on ‘Open in new session’ (so we can see the original code), and set the Package name as NHSRGreetings. We could just pull our old source files into the package – but for this tutorial I’m going to do things the longer way so you also know how to create new functions within an existing package.

    Set the project directory to somewhere memorable.

    For now don’t worry about the git or packrat options – those are tutorials within themselves!

    You are greeted with a package more or less configured up for you. A single source file, ‘hello.R’ is set up for you within an ‘R’ directory within the package. It’s not as cool as our function of course, but it’s not bad! It comes with some very helpful commented text:

    # Hello, world!
    #
    # This is an example function named 'hello'
    # which prints 'Hello, world!'.
    #
    # You can learn more about package authoring with RStudio at:
    #
    #   http://r-pkgs.had.co.nz/
    #
    # Some useful keyboard shortcuts for package authoring:
    #
    #   Install Package:           'Cmd + Shift + B'
    #   Check Package:             'Cmd + Shift + E'
    #   Test Package:              'Cmd + Shift + T'

    So let’s check if the comments are right – hit Cmd + Shift + B on a Mac (on Windows and Linux you should see slightly different shortcuts). You can also access these options from the Build menu in the top right pane.

    You will see the package build. R will then be restarted, and you’ll see it immediately performs the command library(NHSRGreetings) performed, which loads our newly built package.

    If you type hello() at the command line, it will do as you may expect it to do!

    hello()
    [1] "Hello, world!"
    >

    So – time to customise our blank canvas and introduce our much more refined greeter.

    In the root of our project you will see a file called DESCRIPTION. This contains all the information we need to customise our R project. Let’s customise the Title, Author, Maintainer and Descriptions for the package.

    We can now create a new R file, and save it in the R subdirectory as greet_person.R. Copy over our greet_person function. We should be able to run install and our new function will be built in as part of the package.

    We can now get individual team members to open the package, run the build once on their machine, and the package will be installed onto their machine. When they want to use any of the functions, they simply use the command library(NHSRGreetings) and the package will be ready to go with all the functions available to them. When you change the package, the authors will have to rebuild the package just the once to get access to the new features.

    When writing packages it is useful to be very wary about namespaces. One of the nice things about R is that there are thousands of packages available. The downside is that it makes it very likely that two individuals can choose the same name for a function. This makes it doubly important to pick appropriate names for things within a package.

    For example, what if instead of the NHSRCommunity package someone wrote a CheeseLoversRCommunity package with a similarly names greet_person, but it did something totally different?

    In a script, you have full control over the order you load your packages, so R will happily let you call functions from packages and trust that you know what order you loaded things in.

    If you are a package author however, it’s assumed you may be installed on many machines, each with a potentially infinite set of combinations of different packages with names that may clash (or if they don’t already they might do in the future).

    So within the package, any function which doesn’t come from R itself needs to have clearly defined which package it has come from.

    Within DESCRIPTION you must define which package you use, and the minimum version. You do this with the Imports keyword. Attached is the Imports section of one of the SSNAP packages:

    Imports:
        methods (>= 3.4.0),
        lubridate (>= 1.7.4),
        tibble (>= 1.4.2),
        dplyr (>= 0.7.5),
        tibbletime (>= 0.1.1),
        glue (>= 1.2.0),
        purrr (>= 0.2.5),
        rlang (>= 0.2.0),
        readr (>= 1.1.1),
        stringr (>= 1.3.1),
        ssnapinterface (>= 0.0.1)

    Next within your functions, rather than just calling the functions use the package name next to the function. For example instead of calling mutate() from the dplyr package, refer to it as dplyr::mutate() which tells R you mean the mutate function from the dplyr package rather than potentially any other package. There are ways to declare functions you are using a lot within an individual file – but this method makes things pretty foolproof.

    Another tip is to avoid the magrittr pipe within package functions. Whilst magrittr makes analysis scripts nice and clean, firstly you still have the namespace issue to deal with (%>%

    Is actually a function, just one with a funny name – it is really called magrittr::%>%() !) Secondly the way magrittr works can make debugging difficult. You don’t tend to see that from a script. But if you’re writing code in a package, which calls a function in another package, which calls code in another package, which uses magrittr – you end up with a really horrid nest of debugging errors: it is better to specify each step with a single variable which is reused.

    When you’ve got your code in, the next important thing to do is check your package. Build simply makes sure your code works. Check makes sure that you follow a lot of ‘rules’ of package making – including making sure R can safely and clearly know where every R function has come from. Check also demands that all R functions are documented: something which is outside of the scope of this tutorial and is probably the subject for another blog post – a documented function means if you type ?greet_person that you should be able to see how to use the function appropriately. It can help you create your own website for your package using the pkgdown package.

    If your package both builds and checks completely and without errors or warnings, you might want to think about allowing the wider public to use your project. To do this, you should consider submitting your project to CRAN. This involves a fairly rigorous checking process but means anyone can download and use your package.

    If we can get enough people to develop, share their code and upload their packages to CRAN we can work together to improve the use of R across our community.

    Feedback and responses to @drewhill79.

    This blog was written by Dr. Andrew Hill, Clinical Lead for Stroke at St Helens and Knowsley Teaching Hospitals.

  • Roadmap to collaborative working using R in the NHS: Part I- Workflows

    We finally have a tool that is data independent. R allows the scripting of a data science methodology that we can share and develop and which does not necessarily depend on carrying patient data with it. This means that the door to working collaboratively on solutions is now wide open. But before this happens in a truly robust way there are several hurdles that need to be overcome.  This is the first part of a series of articles in which I hope to outline the roadmap to networking our solutions in R so we can work collaboratively between Trusts to accelerate data science solutions for healthcare data.

    Standardising R Workflows in the NHS

    In the throws of a new(ish) tool such as R, everybody has a way of working which is slightly different to one another. This is how best practice evolves, but in the end the best practice has to be standardised so that developers can work together. This is an article outlining my best practice for R script organisation and is an invitation for comment to see if there is any interest in developing a NHS standard for working and organising our solutions.

    Principles of R script development

    As the R development environment is so methods based, it makes sense to have a standardised way to develop scripts so that different developers understand the basic workflow of data and can focus on the specific methodology for the specific problem rather than disentangle endless subsets of data and how each is cleaned and merged etc.  I use various principles when developing a script and useful approach to R script development.

    a) An R script should focus on a specific problem.

    A solution is only as good as the question. Questions come in a variety of scopes and shapes and there is an art to asking a solvable question which is beyond the limits of this article.

    Having defined a question, a script should set out to be the solution to that question and not be a generic answer. Generic answers belong as functions or collections of functions called packages.  An R script should tackle specific problems such as ‘How many endoscopies did we perform last year’ and you find that this kind of question is asked a lot (‘How many x did we perform last y years) then the script might become a function and a collection of functions might become a package.

    b) The R script should access the minimal dataset needed and avoid intermediate datasets.

    There is a real danger with data analysis that the data set used is huge but you only need a part of it. With R you can have the ability to specify the data used from within the script so that you should use the minimum dataset that is pertinent to the question. In fact the whole script should be specific to the question being asked. The data access should, as far as possible also be from the data repository rather than an intermediate dataset. For example you can specify a SQL query from within R to an electronic patient record (EPR) rather than get a data dump from the EPR into, for example, an Excel spreadsheet, and then import the Excel spreadsheet. It’s just more secure and avoids versioning issues with the data dump.

    c) An R script should be organised according to a standardised template

    All analysis I perform adheres to a specific workflow for each script so that the script is separated into specific subsections that perform types of actions on the data. This also incorporates the important aspect of commenting on each part of the workflow. This is important so that developers can understand the code further down the line. The workflow I use is as follows:

    ## Title ##

    ## Aim ##

    ## Libraries ##

    ## Data access ##

    ## Data cleaning & Data merging ##

    ## Data mutating (creating columns from pre-existing data) ##

    ## Data forking (filtering and subsetting) ##

    ## Documenting the sets (usually creation of consort type diagrams with diagrammeR##

    ## Documenting the code with CodeDepends ##

    ## Data analysis ##

    Data access

     The title of the script including the author and date is written at the top. The aim of the script is then stated along with any explanation (why am I doing this etc.). The workflow makes sure that all libraries are loaded at the beginning. Data access is also maintained at the top so anyone can immediately see the starting point for the analysis. Data access should specify the minimal dataset needed to answer the specific question of the script as explained above. For example there is no point using a dataset of all endoscopies between 2001 and 2011 when your script is only looking at colonoscopy specifically. I also try to avoid functions such as file.choose() as I like to keep the path to the source file documented, whether it is a local or remote repository.

    Data cleaning & Data merging

     The difficult task of data cleaning and merging with other datasets is then performed. One of my data principles is that when working with datasets you should start with the smallest dataset that answers all of your questions and then filter down for each question rather than build up a dataset throughout a script, so I like to merge external datasets early when possible. This could be called early binding technically but given the data access part of the code specifies a data set that is question-driven, I am early binding to a late-bound set (if that makes sense).

    Data mutating

     Once cleaning and merging is performed, subsets of data for specific subsections of the question can be done and then the statistical analysis performed on each subset as needed.

    I find it very useful to keep track of the subsets of data being used. This allows for sanity checking but also enables a visual overview of any data that may have been ‘lost’ along the way. I routinely use diagrammeR for this purpose which gives me consort type diagrams of my dataset flows.

    The other aspect is to examine the code documentation and for this I use codeDepends which allows me to create a flow diagram of my code (rather than the data sets). Using diagrammeR and codeDepends allows me to get an overview of my script rather than trying to debug line by line.

    d) R scripts should exist with a standardised folder structure for each script

    R scripts often exist within a project. You may be outputting image files you want access to later, as well as needing other files. R studio maintains R scripts as a project and creates a file system around each script. There are several packages that will also create a file dependency system for a script so that at the very least the organisation around the R script is easy to navigate. There are several ways to do this and some packages exist that will set this up for you. 

    e) R files should have a standard naming convention.

    This is the most frustrating problem when developing R scripts. I have a few scripts that extract text from medical reports. I also have a few scripts that do time series analysis on patients coming to have endoscopies. And again a few that draw Circos plots of patient flows through the department. In the end that is a lot of scripts. There is a danger of creating a load of folders with names like ‘endoscopyScripts’ and ‘timeSeries’  that don’t categorise my scripts according to any particular system. The inevitable result is lost scripts and repeated development. Categorization and labelling systems are so important so you can prevent re-inventing the same script. As the entire thrust of what I do with R is in the end to develop open source packages, I choose to name scripts and their folders according to the questions I am asking. The naming convention I use is as follows

    Top level folder: Name according to question domains (defined by generic dataset)

    Script name: Defined by question in the script dataset_FinalAnalysis

    The R developer will soon come to realise the question domains that are most frequently asked and I would suggest that this is used as the naming convention for top-level folders. I would avoid categorising files according to the method of analysis. As an example, I develop a lot of scripts for the extraction of data from endoscopies. In general I either do time series analysis on them or I do quality analysis. The questions I ask of the data are things like: ‘How many colonoscopies did we do last year’ or ‘How are all the endoscopists performing when measured by their diagnostic detection rates for colonoscopy’. I could name the files ‘endoscopyTimeSeries’ or ‘endoscopyQualityAssess’ but this mechanistic labelling doesn’t tell me much. By using question based labelling I can start to see patterns when looking over my files. According to my naming convention I should create a folder called ‘Endoscopy’ and then the script names should be ‘Colonoscopies_DxRate and ‘Colonoscopies_ByYear’. The next time I want to analyse a diagnostic rate, maybe for a different data set like gastroscopies, I can look through my scripts and see I have done a similar thing already and re-use it.

    In the end, the role of categorizing scripts in this way allows me to see a common pattern of questions. The re-usability of already answered questions is really the whole point of scripting solutions. Furthermore it allows the deeply satisfying creation of generic solutions which can be compiled into functions and then into packages. This has already been expanded upon here.

    f) Always use a versioning system for your R scripts

    R scripts need to be versioned as scripts may change over time. A versioning system is essential to any serious attempt at providing solutions. There are two aspects to versioning. Firstly the scripts may change and secondly the packages may change. Dependency versioning can be dealt with by using checkpoints within the scripts. This essentially freezes the dependency version so that the package that worked with the current version of the script can be used. I have found this very useful for the avoidance of script errors that are out of my control.

    The other issue is that of versioning between developers. I routinely use Github as my versioning system. This is not always available from within Trusts but there are other versioning systems that can be used in house only. Whichever is used, versioning to keep track of the latest workable scripts is essentially to prevent chaos and insanity from descending into the development environment. A further plea for open source versioning is the greater aspiration of opening up R scripts in healthcare to the wider public so that everyone can have a go at developing solutions and perhaps we can start to network solutions between trusts.

    Conclusion:

    There is a greater aim here, which I hope to expand on in a later article, which is the development of a solutions network. In all healthcare trusts, even outside of the UK, we have similar questions, albeit with different datasets. The building blocks I have outlined above are really a way of standardising in-house development using R so that we can start to share solutions between trusts and understand each other’s solutions. A major step forward in sharing solutions would be to develop a way of sharing (and by necessity categorising) the questions we have in each of our departments. Such a network of solution sharing in the NHS (and beyond) would require a CRAN (or rOpenSci) type pathway of open source building, and validation as well as standardised documentation but once this is done it would represent a step change in analytics in the NHS. The steps I have shared above certainly help me in developing solutions and certainly help in the re-use of what I have already built rather than re-creating solutions from scratch.

    This blog was written by Sebastian Zeki, Consultant Gastroenterologist at Guy’s and St Thomas’ NHS Foundation Trust.

  • From script-based development to function-based development and onwards to Package Based development

    At the NHS R Conference, I suggested to people that they should embrace the idea of package-based development rather than script-based work.

    I’m going to talk you through that process, using the simplest of scripts – ‘Hello World’. I’m going to assume that you’re using the freely available RStudio Desktop Edition as the editor for this: other versions of RStudio are likely to be essentially identical. Non R-Studio users may need to revert to more basic principles.

    First let’s write ‘Hello World’ – the simplest R script in the world. Open a new R script file and get your script underway:

    message(“Hello World from the NHS-R Community!”) Save it (for posterity).

    In the conference, we discussed that generally writing functions is more helpful than writing scripts as it gives you greater re-usability. This example is a very trivial one (in fact so trivial as to be nearly redundant).

    So we consider our script carefully and determine – what does it DO? Clearly it’s a way of greeting a person. What if we wanted to greet the person in a different way? What if we wanted to greet a different person?

    So we have defined it’s purpose, and the parameters that are likely to be useful to others.

    Let’s re-write our script to be more useable.

    We define a function using the function function. You can see a much more detailed tutorial of how to do this here: http://stat545.com/block011_write-your-own-function-01.html

    A function is defined by assigning the result of the function() function to a variable which is the function name. The parameters of function() are our new parameter names in our function.

    It is really important to name your function clearly so people know what it does. Generally use active verbs to describe intent, and a consistent naming scheme. Also choose appropriate and clear names for the parameters. So let’s call our new function greet_person, and we will call our parameters greeting and recipient.

    Our new code will look like this. Stick this into a new R script for now and run it:

    greet_person <- function(greeting, sender) {
    
      message(greeting, " from ", sender)
    
    }

    Once you’ve run your script you can now call your function from the console:

    greet_person(“Hello World”, “the NHS-R Community!”) And of course if you want to use a different greeting we can now change our parameter value:

    greet_person(“Welcome”, “the NHS-R Community!”) So far so good.

    But – we’ve had to repeat our sender parameter. What if we know we’re usually going to use that first Hello World greeting; but we just want the option of doing something different if the situation arises?

    We can get around that by supplying default values. In the function() function we can set a value to both greeting and sender using =. Let’s set default values for greet_person:

    greet_person <- function(greeting = "Hello World", sender = "the NHS-R Community!") 
    
    {
      
      message(greeting, " from ", sender)
    
    }

    Now if you want our ‘default’ message you can just call:

    greet_person()

    But you can customise either parameter without having to specify anything you don’t want to change:

    greet_person(sender = "Drew Hill")

    Notice how I’ve specified which parameter I’m supplying? If I don’t do that, R doesn’t know whether we want to specify the greeting, or the string (and assumes you mean the first parameter unless told otherwise). I’d strongly recommend specifying every parameter name to every function you call, unless you have a function which only has one parameter. That means if you (or someone else) changes the function later you’ve been explicit and it won’t get misunderstood.

    OK – so far so good.

    Any sort of software is best to be robust though: so what happens if we abuse our newly created function? The golden rule of creating robust code is ‘fail early, fail cleanly and fail clearly’.

    Our function is clearly designed to use strings. The good news for us is that many things can be converted to strings: so let’s say you provided a number into one of those parameters, it will work:

    greet_person(sender = 1)

    Instead of “Drew Hill” from our previous example, you’ll see the sender is “1”.

    What if you accidentally sent a vector of names? R will turn those into a concatenated string of names without spaces:

    greet_person(sender = c("Bob", "Jim"))

    Some things however certainly could break this process – so it is really important to check that you can handle the inputs you receive within a function before trying to use them.

    The first thing we need to do is to make sure we are dealing with something that can be turned into a character. We can check that by using the is.character function – which returns TRUE if a given value is TRUE, and FALSE if it is not something that can be turned into a character.

    If is.character is false, we want to stop with an error:

    greet_person <- function(greeting = "Hello World", sender = "the NHS-R Community!") {
    
      if (!is.character(greeting)) {
    
        stop("greeting must be a string")
    
      }
    
      if (!is.character(sender)) {
    
        stop("sender must be a string")
    
      }
    
    
      message(greeting, " from ", sender)
    
    }

    We can test how this works by using NULL as a parameter: in real life this happens quite a lot as you try to pass a variable to your new function but forget to set the variable earlier on!

    greet_person(sender = NULL)

    We also know that our function actually isn’t very good at handling vectors of strings (ie where there is more than one name): it will simply shove them all together without spaces. However it works and is perfectly functional. So we have a design decision: do we want to allow that, or not? A third way might be to allow it but to use a warning – perhaps a little over the top in our example, but for complex examples that may make more sense. Whereas stop will halt the code and force you to fix your bugs, the warning() function lets the code continue but tells you to go back and do it better later. Let’s add a warning if there was more than one sender:

    greet_person <- function(greeting = "Hello World", sender = "the NHS-R Community!") {
    
      if (!is.character(greeting)) {
    
        stop("greeting must be a string")
    
      }
    
      if (!is.character(sender)) {
    
        stop("sender must be a string")
    
      }
    
    
      if (length(sender) > 1) {
    warning("greet_person isn't very good at handling more than one sender. It is better to use just one sender at a time.")
    
      }
    
      message(greeting, " from ", sender)
    
    }

    If we now called the function with two senders we’d be able to do so but would get politely told that it’s not a good idea:

    greet_person(sender = c("Jim", "Bob"))

    So – hopefully from this you’ve moved from having a script which would only do precisely what you wanted in a single set of circumstances, to now having a natty little function which will say greet whoever you want, with the type of greeting that you want.

    As an exercise to complete: imagine you work in the NHS-R community welcoming team. You are tasked with sending greetings from the team on a regular basis.

    You used to use a script to do this and had to remember to get the style right every time – but now you sit at your console , run your script containing your function, and greet_person() on demand.

    Your boss has come to you and urgently wants you to change your way of working. Rather than sending a greeting from the team using just a single team name, he wants you to send the individual names in the greeting from both Jim and Bob.

    Have a think about how you could change the function so that we can cope with multiple senders.

    The greetings will continue as we then think about scaling up the NHS R Community Greetings division in our next installment.

    This blog was written by:

    Dr. Andrew Hill

    Clinical Lead for Stroke, St Helens and Knowsley Teaching Hospitals

  • Installing R and R studio

    Installation Instructions (Part 1 of 2)

    Windows Users

    To Install R:

    1. Open an internet browser and go to r-project.org.
    2. Click the “download R” link in the middle of the page under “Getting Started.”
    3. Select a CRAN location (a mirror site) and click the corresponding link.
    4. Click on the “Download R for Windows” link at the top of the page.
    5. Click on the “install R for the first time” link at the top of the page.
    6. Click “Download R for Windows” and save the executable file somewhere on your computer.  Run the .exe file and follow the installation instructions.
    7. Now that R is installed, you need to download and install RStudio.

    To Install RStudio

    1. Go to rstudio.com and click on the “Download RStudio” button.
    2. Click on “Download RStudio Desktop.”
    3. Click on the version recommended for your system, or the latest Windows version, and save the executable file.  Run the .exe file and follow the installation instructions.

    Mac Users

    To Install R

    1. Open an internet browser and go to r-project.org.
    2. Click the “download R” link in the middle of the page under “Getting Started.”
    3. Select a CRAN location (a mirror site) and click the corresponding link.
    4. Click on the “Download R for (Mac) OS X” link at the top of the page.
    5. Click on the file containing the latest version of R under “Files.”
    6. Save the .pkg file, double-click it to open, and follow the installation instructions.
    7. Now that R is installed, you need to download and install RStudio.

    To Install RStudio

    1. Go to rstudio.com and click on the “Download RStudio” button.
    2. Click on “Download RStudio Desktop.”
    3. Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.

    https://courses.edx.org/courses/UTAustinX/UT.7.01x/3T2014/56c5437b88fa43cf828bff5371c6a924/

    Installation Instructions (Part 2 of 2)

    Installing Required Packages

    1.  Open Rstudio (not R) and go to the “Tools” option on the menu bar (Mac users: choose “Not Now” when asked if you want to install developer tools)
    2. Select the option; “Install Packages…”. In the text box for packages type the following (each package name separated by a space);tidyverse readxl lubridate zoo gridExtra padr

    1. Click the Install button (Mac users: again, choose “Not Now” to dev. tools).

    Text and progress bars should start appearing on your screen. If you are prompted about installing packages which need compilation, type y and press Enter.

     Note: If the warning and error messages (similar to below) appear, it may be that you do not have administrator privileges enabled. If you have administrator privileges on your machine then you may be able to address this issue with the following steps: Close RStudio, find the RStudio icon again and right click it. Select “Run as administrator”. Confirm that you want this program to make changes to the computer. Repeat the instructions above (Tools –> Install Packages… )

    Warning in install.packages :

    ‘lib = “C:/Program Files/R/R-3.4.0/library”‘ is not writable

    “Error in install.packages : unable to install packages”

    Once the installation has finished, click beside the command prompt in the Console window (it’s the arrow symbol, > , without text afterward). The cursor should be flashing.

     

     

     

     

     

     

     

     

    Type (rather than copy and paste) the following, carefully (it is case sensitive):

    tidyverse::tidyverse_lo

    At this point the autocomplete feature should provide the option “tidyverse_logo” beneath the text you are typing:

     

     

     

    Press Enter to select this option.

    Then press Enter to execute the line of code.

    You should see the logo (as below), meaning you have successfully installed the tidyverse!