• We need you: share your COVID-19 work on NHSE/I’s regular mini-huddle series

    COVID-19 has placed the value of data and analytics at the fore. This has meant an unprecedented focus on the work of health and care analysts, but also huge demand for local analytical services. The need to share and share alike is paramount.  To support this, NHSE/I are holding a series of huddles and mini-huddles to allow analysts to share their COVID-19 work with a wider community.

    Weekly huddles showcase work on a specific topic, and the back catalogue of recordings can be found here. The mini-huddles are a spin off series where speakers get into the detail behind a single topic, like geo-mapping or a dashboard. The team curating these huddles are keen to have a mini-huddle on COVID-19 analysis in R. If you are interested in sharing your analysis relating to COVID-19 using R, please get in touch with Sophie Hodges via email sophie.hodges5@nhs.net

     If you haven’t already, please sign up to the COVID-19 data and analytics workspace here. The community is currently 6500-strong with members across the NHS, public health, local authorities, charities and commercial organisations. Mini-huddles usually draw an audience between 30 and 80 attendees so it’s a great opportunity to share your work, discuss findings and network with others working on similar issues.

    Please get in touch if this sounds of interest. And a huge thank you from NHSE/I. The work being done by the analytical community is a key plank of the COVID-19 effort.

  • NHSRdatasets meets runcharter

    This post was originally published on johnmackintosh.com but has kindly been re-posted to the NHS-R community blog.

    Background

    The NHSRdatasets package made it to CRAN recently, and as it is designed for use by NHS data analysts, and I am an NHS data analyst, let’s take a look at it. Thanks to Chris Mainey and Tom Jemmett for getting this together.

    Load packages and data

    As above let’s load what we need for this session. The runcharter package is built using data.table, but I’m using dplyr in this main section to show that you don’t need to know data.table to use it.

    library(NHSRdatasets)
    library(runcharter) # remotes::install_github("johnmackintosh/runcharter)
    library(dplyr)
    library(skimr)
    

    However- seriously, do take a look at data.table. It’s not as hard to understand as some might have you believe. A little bit of effort pays off. You can load the runcharter package from github using the remotes package. (I’ve managed to install it on Windows and Ubuntu. Mac OS? No idea, I have no experience of that).

    ae <- data("ae_attendances") # a promise
    ae <- ae_attendances #  a string
    rm(ae_attendances) # just typing 'ae' brings it to life in the environment
    

    That felt a bit glitchy. There has to be a sleeker way to load and assign a built in dataset but I couldn’t find one. Cursory google here.

    Let’s have a look at the data:

    glimpse(ae)
    
    ## Observations: 12,765
    ## Variables: 6
    ## $ period      <date> 2017-03-01, 2017-03-01, 2017-03-01, 2017-03-01, 2...
    ## $ org_code    <fct> RF4, RF4, RF4, R1H, R1H, R1H, AD913, RYX, RQM, RQM...
    ## $ type        <fct> 1, 2, other, 1, 2, other, other, other, 1, other, ...
    ## $ attendances <dbl> 21289, 813, 2850, 30210, 807, 11352, 4381, 19562, ...
    ## $ breaches    <dbl> 2879, 22, 6, 5902, 11, 136, 2, 258, 2030, 86, 1322...
    ## $ admissions  <dbl> 5060, 0, 0, 6943, 0, 0, 0, 0, 3597, 0, 2202, 0, 0,...
    

    Lot’s of factors. I’m actually very grateful for this package, as it caused me major issues when I first tried to plot this data using an earlier version of runcharter. I hadn’t considered factors as a possible grouping variable, which was a pretty big miss, as all the facets were out of order. All sorted now.

    There’s way too much data for my tiny laptop screen, so I will filter the data for type 1 departments – the package help gives us a great link to explain what this means

    type1 <- ae %>% 
      filter(type == 1) %>% 
      arrange(period)
    
    # plot attendances
    p <- runcharter(type1,
                    med_rows = 13, # median of first 13 points
                    runlength = 9, # find a run of 9 consecutive points
                    direction = "above", # find run above the original median
                    datecol = "period",
                    grpvar = "org_code",
                    yval = "attendances")
    

    The runcharter function returns both a plot, and a data.table/ data.frame showing a summary of any runs in the desired direction (I’m assuming folk reading this have a passing knowledge of run charts, but if not, you can look at the package vignette, which is the cause of most of my commits!!)

    Don’t try loading the plot right now, because it is huge, and takes ages. If we look at the summary dataframe,we can see 210 rows, a fairly decent portion of which relate to significant increases above the original median value

    p$sustained
    
    ##      org_code median start_date   end_date  extend_to  run_type
    ##   1:      R0A  21430 2017-10-01 2018-10-01 2019-03-01  baseline
    ##   2:      R1F   3477 2016-04-01 2017-04-01 2017-05-01  baseline
    ##   3:      R1H  28843 2016-04-01 2017-04-01 2019-03-01  baseline
    ##   4:      R1K  11733 2016-04-01 2017-04-01 2019-03-01  baseline
    ##   5:      RA2   5854 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  ---                                                           
    ## 206:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 207:      RLT   6977 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 208:      RQ8   8456 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 209:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 210:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained
    

    Let’s use skimr to get a sense of this

    skimr::skim(p$sustained)
    
    ## Skim summary statistics
    ##  n obs: 210 
    ##  n variables: 6 
    ## 
    ## -- Variable type:character -----------------------------------------------------------
    ##  variable missing complete   n min max empty n_unique
    ##  run_type       0      210 210   8   9     0        2
    ## 
    ## -- Variable type:Date ----------------------------------------------------------------
    ##    variable missing complete   n        min        max     median n_unique
    ##    end_date       0      210 210 2017-04-01 2019-03-01 2017-04-01        9
    ##   extend_to       0      210 210 2017-05-01 2019-03-01 2019-03-01        7
    ##  start_date       0      210 210 2016-04-01 2018-07-01 2016-04-01        9
    ## 
    ## -- Variable type:factor --------------------------------------------------------------
    ##  variable missing complete   n n_unique                     top_counts
    ##  org_code       0      210 210      139 RA4: 3, RDD: 3, RDE: 3, RGN: 3
    ##  ordered
    ##     TRUE
    ## 
    ## -- Variable type:numeric -------------------------------------------------------------
    ##  variable missing complete   n   mean      sd   p0     p25  p50      p75
    ##    median       0      210 210 9389.8 4317.54 3477 6468.25 8413 11311.25
    ##   p100     hist
    ##  29102 <U+2586><U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581>
    

    To keep this manageable, I’m going to filter out for areas that have median admissions > 10000 (based on the first 13 data points)

    high_admits <- p$sustained %>% 
      filter(median > 10000 & run_type == "sustained") %>%
      select(org_code)
    

    Then I change the org_code from factor to character, and pull out unique values. I’m sure there is a slicker way of doing this, but it’s getting late, and I don’t get paid for this..

    I use the result to create a smaller data frame

    high_admits$org_code <- as.character(high_admits$org_code)
    
    
    type1_high <- type1 %>% 
      filter(org_code %in% high_admits$org_code)
    

    And now I can produce a plot that fits on screen. I’ve made the individual scales free along the y axis, and added titles etc

    p2 <- runcharter(type1_high,
                     med_rows = 13, # median of first 13 points as before
                     runlength = 9, # find a run of 9 consecutive points
                     direction = "above",
                     datecol = "period",
                     grpvar = "org_code",
                     yval = "attendances", 
                     facet_scales = "free_y",
                     facet_cols = 4,
                     chart_title = "Increased attendances in selected Type 1 AE depts",
                     chart_subtitle = "Data covers 2016/17 to 2018/19",
                     chart_caption = "Source : NHSRDatasets",
                     chart_breaks = "6 months")
    

    Let’s look at the sustained dataframe

    p2$sustained
    
    ##     org_code median start_date   end_date  extend_to  run_type
    ##  1:      RCB   9121 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  2:      RDD  11249 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  3:      RDE   7234 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  4:      RGN   7912 2016-04-01 2017-04-01 2017-05-01  baseline
    ##  5:      RJ1  12240 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  6:      RJE  14568 2016-04-01 2017-04-01 2018-05-01  baseline
    ##  7:      RJL  11262 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  8:      RQM  16478 2016-04-01 2017-04-01 2018-03-01  baseline
    ##  9:      RRK   9584 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 10:      RTE  11303 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 11:      RTG  11344 2016-04-01 2017-04-01 2018-07-01  baseline
    ## 12:      RTR  10362 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 13:      RVV  12700 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 14:      RW6  22114 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 15:      RWE  12275 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 16:      RWF  11939 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 17:      RWP   9976 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 18:      RXC   9396 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 19:      RXH  12494 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 20:      RXP  10727 2016-04-01 2017-04-01 2017-05-01  baseline
    ## 21:      RYR  11578 2016-04-01 2017-04-01 2018-03-01  baseline
    ## 22:      RCB  10062 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 23:      RDD  12093 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 24:      RDE   7637 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 25:      RGN  11896 2017-05-01 2018-01-01 2018-05-01 sustained
    ## 26:      RJ1  13570 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 27:      RJE  15183 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 28:      RJL  11972 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 29:      RQM  18560 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 30:      RRK  29102 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 31:      RTE  11772 2017-05-01 2018-01-01 2018-05-01 sustained
    ## 32:      RTG  17169 2018-07-01 2019-03-01 2019-03-01 sustained
    ## 33:      RTR  10832 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 34:      RVV  13295 2017-05-01 2018-01-01 2018-03-01 sustained
    ## 35:      RW6  22845 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 36:      RWE  18173 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 37:      RWF  12793 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 38:      RWP  10358 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 39:      RXC  10279 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 40:      RXH  13158 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 41:      RXP  11314 2017-05-01 2018-01-01 2019-03-01 sustained
    ## 42:      RYR  11970 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 43:      RDD  12776 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 44:      RDE  15322 2018-03-01 2018-11-01 2019-03-01 sustained
    ## 45:      RGN  12473 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 46:      RTE  12610 2018-05-01 2019-01-01 2019-03-01 sustained
    ## 47:      RVV  14582 2018-03-01 2018-11-01 2019-03-01 sustained
    ##     org_code median start_date   end_date  extend_to  run_type
    

    And of course, the plot itself

    p2$runchart
    

    I haven’t looked into the actual data too much, but there are some interesting little facets here – what’s the story with RDE, RRK and RTG for example? I don’t know which Trusts these codes represent, but they show a marked increase. Of course, generally, all areas show an increase at some point.

    The RGN (top right) and RVV (mid left) show the reason why I worked on this package – we can see that there has been more than one run above the median. . Performing this analysis in Excel is not much fun after a while.

    There is a lot more I can look at with this package, and we in the NHS-R community are always happy to receive more datasets for inclusion, so please contribute if you can.

  • How NHS-R Community do The Apprentice…

    By Zoe Turner

    One of the tasks on the Apprentice a number of years ago was for the contestants to put on a corporate event, at no small cost to the people attending I might add. It’s a tale often told because one of the contestants was gluten free and no one had accounted for dietary needs amongst the contestants so the poor ‘gluten free lady’, as she was known, was served a fruit salad.

    The reason I’m apparently going off tangent so early in a blog, is that it struck me that the Apprentice is all about throwing people in at the deep end and seeing how they cope. It’s entertainment but clashes with the premise that these are potential apprentices to a ‘British business magnate’ (as Wikipedia calls him). Contrast this with NHS-R and how I came to be attending the Train the Trainer event at the end of 2019 and then helped to run the first of 3 local courses this January, having only just started learning R around 2 years ago.

    Analysts have many expectations made of them. They have to be technical, able to interpret data and communicate themselves clearly to non-analysts. Very rarely though will an analyst be expected to train others. Some may produce or present some training to support or mentor fellow analysts, and even then my experience has always been on the receiving end. Coupled with the fact that I’ve never really had a burning desire to teach, it was a surprise to find myself on a course on how to deliver the NHS-R ‘Introduction to R’ workshop.

    The reason I did it is that my involvement with NHS-R has led to this natural consequence of training others. I started with attending the course myself, then presented at the conference and facilitated an Introduction Course run by NHS-R but hosted by my Trust. I then didn’t hesitate in agreeing to taking on the training.

    NHS-R Community held their first two-day Train the Trainer event in Birmingham organised through AphA (Association of Professional Healthcare Analysts). I was supported to go on this by my manager, Chris Beeley, who is a huge advocate of R and Shiny. Whilst he himself has run several workshops over the years I, notably, have run zero!

    At the TtT (got to get an acronym in there) I had the opportunity to meet lots of motivated people from around the British Isles who were as keen as I was, not only to learn how to teach R but also to talk about data – that happened quite a bit in the breaks. We had an opportunity to present to each other, and that was useful as I learn especially from watching others. Everyone has their own style and it gets perfected over time but I was hugely impressed by how engaging people were and how quickly they could read about a subject that was new to them (we looked at the RStudio presentation slides https://education.rstudio.com/teach/materials/) and then go on to express clearly what they’d just learned.

    I could go on about what I learned at the course, but the proof of its value is in what I did with it. And so on 17th January, Chris and I held a workshop for 15 people from various organisations; some had travelled as far as London and Portsmouth, such is the commitment to learn R. Chris led the workshop and I did the live coding/support role which took the edge off that ‘first time’ feeling.

    This idea of having at least two people in a workshop is a good one, even when the trainer is very experienced. Chris, for example, is very used to running workshops alone, but inevitably people get stuck or things don’t work as they should and so I did the helping while Chris moved between training and coding. It felt, to me at least, like the burden was shared. It helped to ensure that no-one in the group was allowed to fall behind so far that they just gave up.

    Chris and I had gone through the slides beforehand as he’d not gone on the TtT course, and having not written them himself, I wanted to make sure he’d know what was covered. What reassured me was that, as he presented the course, there wasn’t a part of it that I didn’t understand myself and couldn’t cover if I had to take over at that moment. And so the day wasn’t as nerve-racking as I anticipated, and I had fun – which must have been noticeable to the group, as I had an email commenting on how we are clearly a happy team!

    Whilst I haven’t actually run a workshop, I think the process I’ve gone through to get to this point has certainly built up my confidence to do it. I’ve taken every opportunity NHS-R community has offered, from doing the introduction to presenting at the conference, and so this next thing – to run the training myself – hasn’t been so scary after all. I feel like a happy and well-supported apprentice of NHS-R, and the great thing about NHS-R in this is that everyone can be an apprentice too – you just have to get involved.

    Being open-source, all the slides for the trainer course and the introduction course are available on GitHub:

    Train the Trainer course materials can be found at:

    https://github.com/nhs-r-community/train_the_trainer

    Course materials for the underlying Intro to R course are found at:

    https://github.com/nhs-r-community/intro_r

    Zoë Turner, Senior Information Analyst, Nottinghamshire Healthcare NHS Foundation Trust.

    @Letxuga007 and curator of @DataScienceNotts (formerly known as @AppliedInfoNotts)

  • NHS-R Community Conference II

    My journey to work takes me about an hour and a half, and I catch a couple of buses with Wi-Fi which means I can browse Twitter and invariably end up with hundreds of tabs open as I flit between articles and blogs. Most mornings I find it hard to concentrate on reading through entire articles, especially the really long ones, so I leave the tab open on my computer, often for days, before reading them. Given my experience of reading blogs, why would anyone want to read through mine about the NHS-R Community conference?

    If I’d gone to the conference I’d probably skim that paragraph thinking ‘yes, I went, I know how good it was’.

    If I’d not gone to the conference I’d probably skim that paragraph because I might prefer not to know just how great a conference was when I’d missed it!

    Even though the conference was moved to a bigger location to accommodate more people and around 250 people attended, I have still spoken to people who didn’t get a ticket or missed submitting an abstract to speak. People who never made the conference are talking about an event that is only in its 2nd year. What is going on? What is it that has made the event so successful?

    Organising an event of any size takes a lot of work and that is often overlooked. There were the core people who did the real work – the arrangements – and quite frankly, they made it look easy, which itself is an indication of how hard they worked. But there were others who were part of a committee that chipped in with bits they could help with: setting up a specific email; reading through abstracts; suggesting things the organisers might consider, like how to ensure diversity of questioners (https://oxfamblogs.org/fp2p/how-to-stop-men-asking-all-the-questions-in-seminars-its-really-easy/).

    That organising committee was made up from a group who have shown a particular interest in R, and as such I found myself part of that group. Now although I have submitted a few blogs to NHS-R, I only really started using R a couple of years ago. Deep down I’m still a SQL analyst and my contributions to the conference were pretty minimal, but I feel encouraged to make those small contributions (even that last one about who gets to ask the first question in seminars) and each small involvement builds up to a bigger thing. This really is feeling like an equal and inclusive group and that’s where I think this success is coming from.

    It may have been by design or it may be a happy accident but there is a crucial clue in the name of this group that gives away its success – Community. This conference wasn’t managed top-down. There are some key people, of course, but they are as much of this Community as the people who contribute to the blogs, those that stood up on stage and showed their work, have those that will be learning to run the R Introduction training. This is our NHS-R Community.

    If you missed this year’s conference and want to go to the next one, get involved. The more people involved, the less work there is for everyone individually. Plus, given that tickets this year ran out in just 2 hours, you’ll be more likely to secure yourself a ticket.

    Speaking of which, provisional dates for the next conference are the 2nd and 3rd November 2020 (Birmingham). Now aren’t you glad you read this blog!

    Zoë Turner, Senior Information Analyst @AppliedInfoNott @Letxuga007

  • NHS Number Validation

    NHS Numbers – the backbone of the NHS

    NHS Numbers are the best. They are numbers that follow you through life, from birth to death, changing only with certain situations and are easy to check for validity from just the numbers themselves.

    What? You can check to see if a number is a genuine NHS Number without any other information?!

    Now, I’ve worked in the NHS for more years than I care to count and I never realised there was an algorithm to these numbers. I stumbled across this fact through Twitter of all places. Yes, Twitter, the place you’ll find NHS-R enthusiasts tweeting things like this:

    https://twitter.com/sellorm/status/1171858506057682944 which I saw retweeted by @HighlandR (John MacKintosh).

    I’ve only just started doing things on GitHub but @Sellorm wasn’t very close in saying there may be millions of packages as there are, in fact there are only 2 for R, but surprisingly, not a single one in SQL.

    I installed the package, had a quick play around and looked at the code on GitHub. The downside for this package, for me, was that you need to feed in your data using vectors and I like using data tables. Plus, I didn’t necessarily want a list of outputs like TRUE, TRUE, FALSE, TRUE but I wanted to see the NHS numbers that aren’t valid. Still, I wouldn’t have got so far, so quickly, without @Sellorm’s code, excellent notes and thoroughly written Readme file and so the moral of the story is, even if the package doesn’t work for a task or problem, you may still be able to use parts of the code.

    The Readme on https://github.com/sellorm/nhsnumber, which you can see when you scroll down past all of the file directory looking things, included a link to the wiki page on NHS Numbers and its algorithm check https://en.Wikipedia.org/wiki/NHS_number. Again, I had no idea this existed!

    Wiki, like R, is open sourced and one of the links was out of date. I’d found a really useful document on NHS Numbers a while back https://www.closer.ac.uk/wp-content/uploads/CLOSER-NHS-ID-Resource-Report-Apr2018.pdf, so, in the spirit of open source, I updated the Wiki page. Proud moment!

    The algorithm is quite simple and another package on GitHub https://github.com/samellisq/nhsnumbergenerator generates numbers using it but I didn’t spend too long on these packages as I decided to do my own script, nothing fancy, no loops, functions or packages…

    library(tidyverse)
    
    Validity <- df %>% 
      mutate(length = nchar(NHSNumber),
             A = as.numeric(substring(NHSNumber,1,1)) *10,
             B = as.numeric(substring(NHSNumber,2,2)) *9,
             C = as.numeric(substring(NHSNumber,3,3)) *8,
             D = as.numeric(substring(NHSNumber,4,4)) *7,
             E = as.numeric(substring(NHSNumber,5,5)) *6,
             G = as.numeric(substring(NHSNumber,6,6)) *5,
             H = as.numeric(substring(NHSNumber,7,7)) *4,
             I = as.numeric(substring(NHSNumber,8,8)) *3,
             J = as.numeric(substring(NHSNumber,9,9)) *2,
             End = as.numeric(substring(NHSNumber,10,10)),
             Total = A+B+C+D+E+G+H+I+J,
             Remainder = Total %% 11,
             Digit = 11- Remainder,
             Summary = case_when(Digit == 10 ~ 999,
                                 Digit == 11 ~ 0,
                                 TRUE ~ Digit),
             Valid = case_when(Summary == End & length == 10 ~ TRUE,
                               TRUE ~ FALSE
                               )) %>% 
      filter(Valid == FALSE)
    

    Importing data

    Data is imported into R via Excel or, in my case, I used a SQL connection which worked better as a couple of our systems hold many hundreds of thousands of NHS Numbers and I wanted to check them all. Excel might have got a bit of indigestion from that. Also, it meant I wasn’t storing NHS Numbers anywhere to then run through R. My systems are secure but it’s always worrying having such large amounts of sensitive data in one place and outside of an even securer system like a clinical database.

    Data doesn’t always import into R the same way and for mine I needed to remove NULLs and make the NHSNumber column, which was a factor, into a character and then numeric format:

    df <- data %>% 
      filter(!is.na(NHSNumber)) %>% 
      mutate(NHSNumber = as.numeric(as.character(NHSNumber)))
    

    Factors

    Factors are a new thing for me as but, as I understand it, they put data into groups but in the background. For example, if you had male, female and other these would be 3 factors and if you join that to another data set which doesn’t happen to have “other” as a category the factor would still linger around in the data, appearing in a count as 0 as the table still has the factor information but no data in that group.

    To manipulate factor data I’ve seen others change the format to character and so that’s what I did. I’ve done similar things in SQL with integers and dates; sometimes you have to change the data to an ‘intermediate’ format.

    Validity code – explanation

    This code uses dplyr from tidyverse to add columns:

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber), 
    

    An NHS number must be 10 characters and nchar() reminds me of LEN() in SQL which is what I would use to check the length of data.

    One thing I didn’t code, but I guess people may want to check, is that all characters are numeric and no letters have been introduced erroneously. That’s something to consider.

         A = as.numeric(substring(NHSNumber,1,1)) *10,
         B = as.numeric(substring(NHSNumber,2,2)) *9,
         C = as.numeric(substring(NHSNumber,3,3)) *8,
         D = as.numeric(substring(NHSNumber,4,4)) *7,
         E = as.numeric(substring(NHSNumber,5,5)) *6,
    

    I didn’t use F as it’s short for FALSE which changed the meaning and colour of the letter. It can be used as a column name but I missed it out for aesthetics of the script!

         G = as.numeric(substring(NHSNumber,6,6)) *5,
         H = as.numeric(substring(NHSNumber,7,7)) *4,
         I = as.numeric(substring(NHSNumber,8,8)) *3,
         J = as.numeric(substring(NHSNumber,9,9)) *2,
    

    There is possibly a much smarter way of writing this, perhaps a loop. Samuel Ellis’ package nhsnumbergenerator creates a table using sequences:

    checkdigit_weights = data.frame(digit.position=seq(from=1, to = 9, by =1),
                                    weight=seq(from=10, to = 2, by =-1)
    )
    

    note he uses = rather than <-

    I like this code. It’s very concise, but it just was easier writing out the multiplications for each part of the NHS Number; the first number is multiplied by 10, the second by 9, the third by 8 and so on.

         End = as.numeric(substring(NHSNumber,10,10)),
    

    I put this in as the final number in the sequence is the check for the later Digit (also called checksum).

         Total = A+B+C+D+E+G+H+I+J,
    

    Just adding all the multiplied columns together.

         Remainder = Total %% 11,
    

    This gives the remainder from a division by 11.

         Digit = 11- Remainder,
    

    11 take away the remainder/checksum.

         Summary = case_when(Digit == 10 ~ 999,
                             Digit == 11 ~ 0,
                             TRUE ~ Digit),
    

    Lots of people use if(else) in R but I like case_when because it’s like SQL. The lines run in logical order:

    • when the digit = 10 then it’s invalid so I put in 999 as that’s so high (in the hundreds) it shouldn’t match the Digit/Checksum (which is a unit),
    • if it’s 11 then change that to 0 following the methodology,
    • else just use the digit number.

    Actually, thinking about it I probably don’t need the 10 becomes 999 as 10 could never match a single unit number. Perhaps that’s redundant code.

         Valid = case_when(Summary == End & length == 10 ~ TRUE,
                           TRUE ~ FALSE
                           )) %>% 
    

    Case_when again but this time to get the TRUE/FALSE validity. If the number generated is the same as the last digit of the NHS Number AND the length of the NHS Number is 10 then TRUE, else FALSE.

    I liked this like of code as it was a bit strange saying TRUE then FALSE but it’s logical!

      filter(Valid == FALSE)
    

    Just bring back what isn’t valid.

    Did it work?

    I think it did. If I got anything wrong I’d love to get feedback but I ran through many hundreds of thousands of NHS Numbers through it and found…..

    No invalid numbers

    I possibly should have checked beforehand but I suspect our clinical systems don’t allow any incorrect NHS Numbers to be entered in at source. Still, it was fun and could be applied to manually entered data from data kept in spreadsheets for example.

    An interesting blog

    As here were no SQL code scripts on GitHub for NHS Number validations I did a quick search on the internet and found this: https://healthanalyst.wordpress.com/2011/08/21/nhs-number-validation/ which is a blog by @HealthAnalystUK from 2011. The reason I’m referring to it in this blog is because HealthAnalystUK not only shared SQL code that looked very similar to the code here but also R and uses it as a function:

    NHSvalidation <- function(NHSnumber){
    
    NHSlength<-length(NHSnumber)
    
    A<-as.numeric(substr(NHSnumber,1,1))
    B<-as.numeric(substr(NHSnumber,2,2))
    C<-as.numeric(substr(NHSnumber,3,3))
    D<-as.numeric(substr(NHSnumber,4,4))
    E<-as.numeric(substr(NHSnumber,5,5))
    F<-as.numeric(substr(NHSnumber,6,6))
    G<-as.numeric(substr(NHSnumber,7,7))
    H<-as.numeric(substr(NHSnumber,8,8))
    I<-as.numeric(substr(NHSnumber,9,9))
    J<-as.numeric(substr(NHSnumber,10,10))
    
    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))
    
    {UniformNumberCheck<-1}
    
    else
    
    {UniformNumberCheck<-0}
    
    Modulus<-((A*10)+(B*9)+(C*8)+(D*7)+(E*6)+(F*5)+(G*4)+(H*3)+(I*2))
    Modulus<-(11-(Modulus%%11))
    
    if (
    
    ((Modulus==J) & (UniformNumberCheck!=1) & (NHSlength==10))|((Modulus==11) & (J==0) & (UniformNumberCheck!=1) & (NHSlength==10)))
    {ReturnValue<-1}
    else
    {ReturnValue<-0}
    
    return(ReturnValue)
    }
    

    I hadn’t coded the check for repeating numbers:

    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))

    and I couldn’t find any reference to this in the Wiki page or the document from the University of Bristol so I’m unsure if this is a part of the methodology. If it is, then I’ve seen at least 1 NHS Number that would fail this test.

    A conclusion

    If anyone has created a package or script for NHS number checks and wants to share please feel free to write a blog. NHS-R Community also has a GitHub repository at https://github.com/nhs-r-community where code like this blog can go (I wrote this in Rmarkdown).

    Blogs can be emailed to nhs.rcommunity@nhs.net and get checked by a group of enthusiast volunteers for publishing.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • How R changed me as an analyst

    I suspect there are many blogs about R and why it’s so great:

    • It’s free!
    • It’s open source!
    • There’s a great community!
    • It’s reproducible!

    You can certainly read on Twitter (#rstats) about what R cando for you but what about what R does to you, particularly as an analyst in the NHS or social care?

    The learning bit…

    Back in April 2018 when NHS-R Community was running its first introduction to R course in Leeds, my only knowledge of it had come from a free online base R course with edX that I hadn’t really understood and never finished. Online learning for me is like being back at school. I get the right answer, feel good, move on and promptly forget it all. After the NHS-R course I dabbled with the dplyr package, tried to run lots of other people’s scripts and generally failed a lot. It was a frustrating time of slow delivery of tasks and bafflement. When things did work, I had no idea why and I often wondered what all the fuss was about particularly as I could do the same things in familiar programmes.

    Hindsight is a wonderful thing and I can now see my frustrations weren’t just one lump of confusion, but could be split down into how I used these ‘familiar programmes’, namely:

    • SQL for data engineering and
    • Excel for visualisations.

    Although I still used (and use) SQL to get my data, I was copying it to Excel and then loading it into R; once loaded I’d then realise I needed to group by and count or remove something I didn’t need and it seemed too long-winded going back to SQL, copying to Excel and then loading it.

    The second frustration of visualising in R came with the inability to replicate the look of the Excel charts in R: getting the same colours, the same font size headers and so on. I’ve yet to resolve that completely but it was here that I realised the worth of R wasn’t in making it look like Excel, but rather that it could do so much more than Excel. I needed to start thinking about what I should be visualising and how to do it.  

    Sharing methodology

    Over my time in the NHS I have learned to be cautious – protective even – of data. But that has led to the misguided fear of sharing technical knowledge which was never a conscious thing, just that’s how it is. However, R has a reputation of sharing which has resulted in an expectation of sharing. And that isn’t just within your own team or organisation, it’s so much wider – it can even be the world.

    As an example of why it’s harder to share Excel methodology, I’ve previously built a benchmarking Excel spreadsheet using MATCH and INDEX so that the bar charts automatically coloured the organisation I worked for and ordered the bars in graphs from the greatest to the smallest. It was one of those tasks that took a lot of effort to automate, looked simple when done but was heavily dependent on the data being in just the right cell or it would break.

    Just updating it with a new year’s data would take great care so the thought of writing out the methodology to share never occurred to me. Writing it out would involve describing the positions of data, what the formulae did and how bits all linked. That’s a laborious task which is not necessary if you don’t plan to share – and as there was no need to share, I didn’t.

    It’s all about the data

    The entire process, from SQL to Excel, is about the data, e.g. how it joins and what it’s counting. To get the data ‘just so’, it often requires so many clever solutions to so many problems that, as I now realise, it consumes so much thinking time that there’s often little energy left for considering why I am doing this – is it the best thing to do and how can I get more meaning from the data?

    If I’d picked up someone else’s script or Excel document on ordering benchmarking data, perhaps the time I spend would be on improving it instead of building it. In a perfect world, I would feed back on what I’d done and share any improvements or corrections.

    But what can R do that SQL and Excel can’t?

    As a very simple example, consider creating a median average.

    In Excel it’s reasonably easy to use the formula MEDIAN() but to make it dynamic (such that it automatically updates if you add more data), the formula becomes much more complicated. Here is a page explaining how it’s done:

    https://www.extendoffice.com/documents/excel/2611-excel-average-dynamic-range.html

    There are lots of graphics are used to describe how to do it and you’ll note this is for AVERAGE which is mean rather than median.

    In SQL, creating the MEDIAN can be solved various ways:

    https://stackoverflow.com/questions/1342898/function-to-calculate-median-in-sql-server

    There are 204 examples to go through to solve this! I didn’t go through it as that’s too much needless thinking required when R can do this…

    median(table$column)

    Something this simple removes all that otherwise inevitable necessary thinking to figure out the best way to get the median… and then having to check that it is right. Although that may be easy enough in Excel, I know I make mistakes and will have to repeat the exercise more times than I care to admit, and doing so in Excel will involve so many steps that each require checking. All of this uses up that precious resource of focused thinking. With R doing the job so quickly and reliably I now have time to consider if median is actually the right thing to use, particularly if it’s a case of “we’ve always done that”. Then I can ponder on what is it telling me; is it increasing/decreasing and why is that? Is there any other data that could help explain this change?  

    Like any great piece of analysis or coding, time invested at the beginning pays off. With the benchmarking example of reordered bar charts, spending a couple of days getting it right made something interactive that was engaging for the users. But Excel continues to require thinking and time checking whereas R doesn’t; once it’s coded, that’s it. And that’s where the analyst should be, that space after the work is done. That’s where you can think about the very analysis itself; was it enough? What does it tell you? What else can you look at? Although I’ve had analyst in my job title for a long time, this is what analysis is all about and it’s something, I now realise, that I’ve not been doing because my “thinking time” has been used up elsewhere.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • Exact Matching in R

    Exact matching in R: a case study of the health and care costs associated with living alone, using linked data from the London Borough of Barking & Dagenham

    I’ve been working with a group of analysts in East London who are interested in joined-up health and social care data. They’ve created a powerful, unique dataset that shows how each resident of the London Borough of Barking & Dagenham interacts with NHS and council services. The possibilities are enormous, both in terms of understanding the way that people use services across a whole system, and for more general public health research.

    You can download a sample of the data here:

    Today we’re interested in whether social isolation is related to healthcare costs, and we’re going to use exact matching to explore this issue. Our theory is that people who live alone have higher healthcare costs because they have less support from family members.

    Here’s an extract from the data. ‘Total cost’ means the cost of health and social care in one year. The actual dataset has many more variables!

    ID Age groupSex Number of long-term conditionsLives aloneTotal cost
    1 50-64 Male 0 No £93
    2 65-74 Female 1 Yes £0
    3 50-64 Male 4 No £1065
    4 85+ Female 5 Yes £7210

    For the purposes of this blog, we’ll use a sample of 5000 (out of 114,000) individuals, with some changes to the values to ensure anonymity.

    Descriptive stats

    The first thing we’ll look at is whether people who live alone are different. I find descriptive tables fiddly so often use a function that can be re-used across a number of variables:

    # first, save the data provided above and change your working directory to the place you've saved it with setwd()
    # read data
    library(data.table) # useful for efficient reading and summarising of data
    d <- fread("sample_lbbd_dataset.csv")
     
    # describe age, sex and number of long-term conditions
    describe.by.isolation <- function(variable) {
    a <- table(unlist(d[, variable, with = F]), d$livesalone)
     p <- round(prop.table(a, 2) * 100, 0)
     matrix(paste0(a, ' (', p, ')'), ncol = 2, dimnames = list(row.names(a), c('Live with others', 'Lives alone')))
    }
    lapply(c('age_grp', 'sex', 'LTC_grp'), describe.by.isolation)
     
    # describe healthcare costs
    d[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
     
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(d$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(d$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    People who live alone are older, more likely to be female and have more long-term health problems. Their mean healthcare costs are £2,347 higher. The difference in healthcare costs is visible on the histograms, which we have displayed on a log scale because some values are extremely high. There’s some ‘zero inflation’ in both groups – people with no healthcare costs who do not fit into the otherwise lognormal-ish distribution. So far not too surprising – but are the increased healthcare costs explained by the differences in age, sex and health?

    Regression

    One approach would be to use regression. We could fit a linear model – this is actually not a great fit for patients with higher healthcare costs, but we won’t go into that here.

    linear_model <- lm(cost ~ livesalone + age_grp + sex + LTC_grp, d)
    plot(linear_model) # diagnostics show that a linear model is not a great fit. You might have to press return to see all the plots before you can continue.
    summary(linear_model)

    The results suggest that health and care costs for people who live alone are £892 more than those who do not live alone, on average. Clearly the variables we added to the model are important confounders, as this is much lower than the ‘crude’ difference of £2,347. Two limitations of this approach are that we will have to think quite carefully about the fit of the model to the data, and that we can’t describe how living alone affects the distribution of costs.

    Exact matching

    We therefore used exact matching, in which each individual who lives alone is matched to an individual who does not live alone, based on specified variables such as age group and sex. We developed a function that does this by stratifying the data and then matching ‘cases’ and ‘controls’ randomly. Unmatched individuals are deleted from the dataset, leaving a dataset that is balanced in terms of the variables you specify. Let me know here if there’s any other functionality you want and we can try and incorporate it.

    Let’s try matching on age group, sex and the grouped number of long-term conditions:

    # data = dataset containing:
    # - treatment/exposure variable 'mvar' (a string specifying variable name).
    # - matching variable 'mvar' (a string specifying variable name). If you want to match on multiple variables, concatenate them first.
    # other inputs are:
    # - ratio of cases:controls (an integer > 0)
    # - seed for fixing random selection of cases/controls (an integer; default NULL means no seed). Choice of seed is arbitrary.
    # returns data.table of matched observations, with additional variable 'id' for use in paired/grouped analyses
     
    smatch <- function (data, treat, mvar, ratio = 1, seed = NULL) {
     setnames(data, mvar, '.mvar')
     targ <- data[, .(case = sum(get(treat)), control = sum(!get(treat))), .mvar]
     targ[, cst := floor(pmin(control / ratio, case))]
     targ[, cnt := cst * ratio]
     targ <- targ[cst > 0]
     l2 <- cumsum(targ$cst)
     ids <- mapply(':', c(0, l2[-nrow(targ)]), l2-1)
     names(ids) <- targ$.mvar
     case <- NULL
     control <- NULL
     x <- .Random.seed
     set.seed(seed)
     on.exit({.Random.seed <- x})
     for(i in targ$.mvar) {
       case[[i]] <- data[get(treat) == T & .mvar == i][sample(.N, targ$cst[targ$.mvar == i])]
       case[[i]][, id := ids[[i]]]
       control[[i]] <- data[get(treat) == F & .mvar == i][sample(.N, targ$cnt[targ$.mvar == i])]
       control[[i]][, id := rep(ids[[i]], each = ratio)]
      }
     rbindlist(c(case, control))
    }
     
    # create a single variable summarising matching variables
    d$mvar <- do.call('paste0', d[, c('age_grp', 'sex', 'LTC_grp')])
     
    # create 1:1 matched dataset.
    matched_data <- smatch(d, treat = 'livesalone', mvar = 'mvar', ratio = 1, seed = 74)
     
    # check balance: same number of individuals in each group
    dcast(matched_data, age_grp + sex + LTC_grp ~ livesalone, value.var = 'id', fun.aggregate = length)

    Now we have a dataset that is balanced in terms of age, sex and the count of long-term conditions. Let’s see how healthcare costs compare:

    matched_data[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
     
    # histograms of cost
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(matched_data$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(matched_data$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    # t-test (in reality you might want a paired test, and to check whether a t-test is appropriate)
    t.test(cost ~ livesalone, matched_data) # notice how wide the confidence intervals are for this reduced dataset

    # proportion with costs over £10000
    matched_data[, .(over10k = sum(cost > 10000) / .N), livesalone]

    The mean difference is £803. When we used the whole dataset, this value was even closer to the coefficient from linear regression. It’s now difficult to see a difference in the histograms, but you can easily create any description of the distribution that you like – e.g. the proportion of patients that have costs over £10,000.

    Who got matched and who didn’t?

    The point of matching was to create a comparison group of people who don’t live alone who were in some ways similar to the group who do. We probably had to delete lots of people who don’t live alone in a systematic way (e.g. men and younger people who do not live alone were more likely to be deleted). We might also have deleted some of the group who do live alone, which could be more problematic if we want to generalise our results to the population. Let’s see who got deleted…

     d[, matched := ifelse(PID %in% matched_data$PID, 'matched', 'unmatched')]
     
    # just looking at age group for now
    compare_matched <- dcast(d, age_grp ~ livesalone + matched, value.var = 'PID')
    compare_matched[, 'TRUE_total' := TRUE_matched + TRUE_unmatched]
    compare_matched[, lapply(.SD, function(x) x/sum(x) * 100), .SDcols = 2:6]

    You can see that some of the ‘lives alone’ group got deleted (‘TRUE_unmatched’), and they were all in the older age groups. The difference between everyone who lives alone (‘TRUE_total’) and the matched group (‘TRUE_matched’) is diluted, because the oldest groups are a relatively small part of the data. Nonetheless, I would say this is a fairly important difference. If you are concerned about generalisability to the population, you might want to restrict the analysis to people aged under 85. In the full dataset this was not be a problem (as there were lots more potential ‘controls’), but you might encounter similar issues if you match on more detailed variables.

    Final point! We’ve focused on a technical solution to exact matching. We haven’t really thought about which variables we should be matching on. This is a matter of judgement, and needs a bit of thinking before we dive into the matching process. Just like covariates in regression, the matching variables are confounders – they’re alternative explanations for an association between living alone and healthcare costs. Age and sex are clearly important. Health status is more complex. Do you want to think about long-term conditions as something that cause people to live alone, or the product of living alone? If people are sicker because they live alone, matching on the number of long term conditions might falsely reduce the association between living alone and healthcare costs.

    This blog was written by Dan Lewer, NIHR Doctoral Research Fellow / SpR Public Health. Department of Epidemiology & Public Health, UCL

    Twitter: danlewer

  • Moving on with the plan…And using R to do it! My personal experience…

    When Mohammed asked if I would be interested in doing a blog on how “we”, Information and Analytics in NHS Improvement, have been using R / R Studio I was a little apprehensive as I have never ‘blogged’ before, but I thought I would give it a try!

    So here goes —- my first ever blog!

    Many of us will remember Garry Fothergill’s engaging ‘lightning talk’ at the NHS-R Conference back in October last year ‘So what’s your plan’. Garry gave us a synopsis of how NHS Improvement had been using R Studio to support the annual process (torture to some) of activity planning for 2018/19. The original concept, ignited by Paul Stroner, arose from a central frustration of ‘flat line’ unrealistic activity plans of the past. I am sure some of us have been guilty of that, I know I have on occasion in the past!

    Since the original piece of R work, the team have been looking at further developments to the approach that Garry and Paul had set the wheels in motion on, with particular reference to how it could be used to support the 2019/20 planning round more formally. Back in mid-October it was agreed that both NHS Improvement and NHS England would use the univariate modelling approach that Paul and Garry had been championing.

    As part of this process the R code was reviewed and rewritten with some changes to methodologies in terms of validation processes (out of sample testing), applying adjustments for working and calendar days as well as models applied. The final R code was tested / Q&A’d by some of our esteemed NHS-R Community colleges and the overall approach was signed off by NHS Improvement’s Community of Practice for Statistics.

    As part of our offer to support CCGs and Acute Providers, a specific organisational level R code was developed (the code we used centrally – pulled in all 148 Acute providers for over 15 activity lines, based on day weighted and unweighted historical data, so you can imagine the number of models). The R code has been widely shared with organisations on request but also posted on Future NHS site and is also available on our newly created Github account.

    I personally can’t take credit for this R code, we are lucky in the Information and Analytics team that we have a colleague who has extensive R programming background …… if I could physically plug into his R brain I would! It is this expertise (and Chris Beeley’s workshop at the NHS-R event) that have opened my eyes to the art of the possible in using R Shiny Apps. This has led us down the path of designing and creating an R Shiny App, which allows organisations to replicate their forecasted figures that have been centrally provided within the activity planning submission template over the internet.  This tool can be used for any monthly data, all you need to do is make sure you have the upload data structured correctly, there is user help functionality included with the App – just click on the link below.

    https://nhsiadvancedanalytics.shinyapps.io/Forecasting_Support/

    I’m only at the start of my R journey, but I can already see the benefits of using it daily to support all aspects my reignited analytical life, so I’m excited about what the future holds! It’s a positive sign and a step in the right direction when these software programmes are being talked about by our NHS leaders, but what I am most enthused about is the will and the want to work collaboratively and share learning on all things R without judgement across organisations both internal and external to the NHS. So, I’m fully signed up to spreading the R message by being an active participant in any local R Working groups, presenting on R possibilities at different networking events whilst working as hard as I can on improving my own R skills. Watch this space, I may even take the leap and do some more blogs about my ‘R’ journey!

    This blog was written by Vicki Cruze, Principal Analyst in the Performance Analysis Team at NHS England and NHS Improvement.

  • Using R to track NHS winter pressures

    Every Thursday during winter, roughly from December to March, NHS Digital releases a week’s worth of hospital performance data, known as the Daily Situation Reports. This data often receives media attention because cold weather and seasonal illnesses can lead to higher demand for hospital care, meaning that services might be under increased pressure. At the Health Foundation, one of our aims is to provide new insight into the quality of health care through in-depth analysis of policy and data. So, to understand more about the current demand for hospital care and the challenges the NHS is facing, we keep a close eye on the latest seasonal trends.

    Keeping on top of NHS winter indicators has the potential to keep us analysts busy. The raw data is published in a fairly complex spreadsheet, requires a decent amount of cleaning and needs to be reanalysed after every release. In addition to monitoring national figures, this winter our team at the Health Foundation also wanted to see if there might be any variation between different areas of England. Sustainability and transformation partnerships (STPs) are areas where health and care leaders develop shared proposals for local services. Therefore, we enriched the raw data with information about where hospitals are located, and which STP they belong to. But with a similar analytical approach, more fine-grained local structures (such as Clinical Commissioning Groups) could be used.

    For a more efficient and reproducible way of tracking NHS winter indicators this year, we moved to our whole analysis pipeline to R. We then used the clean data for visualisations in R and other applications, like Tableau. This blog takes you through our analysis workflow and describes how we got through some tricky steps. The complete R script is also available on GitHubif you want to give it a go yourself. You can also read a blog on the Health Foundation’s website to find out why we looked at local areas this year and what we found.

    Why write this blog on data analysis?

    Analysts at many other local and national organisations are interested in NHS winter performance data. In order for this blog to be a good resource for them, we plan to:

    • share our own analytical approach and R code
    • illustrate how and why we made certain analytical decisions and
    • discuss what we learned along the way, both about the data and R.

    We hope that this blog will inspire others to do the same, and to share their code too. Here, we won’t try to interpret any regional differences in winter performance. We know that there are several factors involved, so we will leave this up to others with more local knowledge.

    What does this blog cover?

      1. R packages we used
      2. Data download from NHS Digital
      3. Data import from R-unfriendly spreadsheets: how we tackled multi-line headers with merged cells
      4. Data cleaning: how we defined and dealt with missing data
      5. Feature engineering: how we added organisational information on sustainability and transformation partnerships (STPs)
      6. Aggregation: monthly averages within local STPs
      7. Visualisation: how we mapped STP-level data in R
      8. What have we learned?


    1. R packages we used

    The tidyverse collection of R packages is a useful set of tools for data analysis and visualisation that are designed to work together. It contains the ggplot package, which we use for visualisations. (Need help getting started with R and the tidyverse? Try the R for Data Science website).

    We use a function from the readxl package to import Excel files and the lubridate package, which makes working with dates a lot easier. Both are part of the tidyverse, but not loaded automatically along with the tidyverse package.

    The geojsonio package is one option that helps us import geographic data structures, such as STP boundaries. The broom and maptools packages are then needed to prepare these for visualisation with ggplot.

    https://gist.github.com/fiona-grimm/8b45642fc4ca8c92bf9294dbcd6be08c

    Note: if you would like to run this code in your own R session and you haven’t installed these packages already, you will need to do so (once) using the install.packages() function.

    2.  Data download from NHS Digital

    After setting up the tools, we downloaded Winter Daily SitRep 2018–19 Data from the NHS Digital website. Rather than opting for the spreadsheets that contain one week of data each, we went for the more convenient time series file, which contains all data from this winter up to the latest release. One drawback is that the name of this file changes weekly as new data is added (so, should you try to use this code at a later date, you will probably have to adapt the file name).

    https://gist.github.com/fiona-grimm/101ef292cd56a92cc8222bb8a0fa69ae


    3. Data import from R-unfriendly spreadsheets

    How we tackled multi-line headers with merged cells

    Once we opened the file in Excel, we saw a number of sheets containing different indicators. With a few exceptions, most have a similar table structure. As we were interested in patient flow through hospitals, we focused on ‘General and acute beds’ and ‘Beds occupied by long-stay patients’ for now.

    What the sheets with these indicators had in common was that there was metadata in the first 13 lines followed by a two-line header. Several columns containing variables (second header line) were grouped within dates (first header line) and the cells around the dates were merged. There were also some empty columns and rows, which we addressed later on.

    Example of the Excel layout of Winter SitRep data to be imported into R. Source: NHS Digital.

    Unfortunately, this was not yet a tidy table, as the dates that served as column headers were values, not variable names. All this made importing the file into R slightly more challenging. We fixed this by creating a custom import_sitrep() function that would:

    1. read and store the data and the two lines of the header separately,
    2. combine the two lines of the header,
    3. add the combined header to the data and tidy up column names by removing special characters,
    4. and finally convert the table into a tidy table ready for the next step.

    But wait, there was one more tricky bit. What happened when we tried to read in the headers one by one?

    https://gist.github.com/fiona-grimm/756b7b827895f58579c810c252e6c384

    *Sigh*… they ended up not having the same length. The first header line (containing the dates) was 8 elements shorter. Looking at its left and right side (see above) gave us a hint as to why this might have happened:

    • In the Excel sheet, the first few cells in this line were empty and when the line was read in, they were converted to NA. The read_xlsx() function then discarded these empty columns (probably) because they were at the beginning.
    • There were also some merged cells. During import they were separated and, if empty, converted to NA. Empty columns at the end of the header line also seem to be discarded by read_xlsx().

    So, we needed to find a way to preserve the length of the first header line in our import_sitrep() function. This is how we solved it:

    https://gist.github.com/fiona-grimm/f49eecefbe57faddfc0b2abefa9c543e

    Now that we had our import function, we were ready to read and combine the sheets containing the winter indicators ‘General and acute beds’ and ‘Beds occupied by long-stay patients’.

    https://gist.github.com/fiona-grimm/b5e444590367aec6ed968a8663138973

    The data was now tidy, as each variable formed a column and each observation formed a row. Note that the observations for England in the table were followed by values for individual hospital trusts.

    4.  Data cleaning

    Trust exclusions

    We excluded the three children’s hospitals when calculating aggregate measures, such as the average bed occupancy within STPs. Our reasoning was that their patient profiles would be different from other acute trusts and this might skew the averages. Nevertheless, we kept track of them at a trust level.

    This applies to Birmingham Women’s and Children’s NHS Foundation Trust (code RQ3), Alder Hey Children’s NHS Foundation Trust (RBS) and Sheffield Children’s NHS Foundation Trust (RCU).

    https://gist.github.com/fiona-grimm?page=2

    How we approached defining ‘missingness’: when is a zero not a zero?

    Data collection and validation errors do happen, so finding suspicious data points is an important step during data cleaning.

    While this is easy if a value is missing (or NA), it’s much harder to decide whether a zero truly represents ‘zero events’ or a missing value (in fact, it could even be both within the same data set). To distinguish between the two, at the Health Foundation we came up with the following criteria:

    • How likely is a ‘zero event’ for an indicator? For example, when counting beds in a large hospital the likelihood of having zero open seems small, but when counting long-stay patients having none seems possible.
    • How consistent is the zero value, in that trust, over time? Or in plain English: does the value jump from a higher number to zero (and back) or is it hovering somewhere close to zero.

    The next two sections describe how we found and dealt with these missing values.

    Finding longer periods of missing data

    If any hospital trust had missing values, in any indicator, on 4 or more consecutive days during the reporting period, it was excluded from the analysis. We were only looking for these periods in variables where we would not expect any zeros (the list is shown as cols_to_check).

    Why this particular cut-off? We wanted to aggregate the data and calculating weekly averages did not seem justified if we were missing more than half of a week for any trust.

    Here’s how we summed up how many consecutive days were zero or NA within each trust/variable combination:

    https://gist.github.com/fiona-grimm/9d4ca6567b605118544dd24b709a177f

    https://gist.github.com/fiona-grimm/d436454c392ea9bf78ae3ee9083310d4

    When we filtered for 4 or more consecutive days, we found that:

    • The trust with the code RTD reported zero long-stay patients (of any length of stay) for the whole reporting period to date, which seemed unrealistic for a general and acute hospital.
    • Trust RQW had a gap of 7–8 days, that coincided for the indicators shown (we checked this separately in the raw data).
    • Trust RAX reported zero long-stay patients (of any length of stay) for 6 days during January, but reported a high number before and after.

    Based on this, all variables from the trusts RTD, RQW and RAX were excluded from the analysis of this year’s (2018/19) winter data. This left us with 128 out of 134 trusts.

    It’s worth noting that with this data-driven approach different trusts might be excluded each year and the number of excluded trusts could change over the winter period as new ‘gaps’ appear. Keep this in mind when making comparisons, both throughout the winter period and between years.

    https://gist.github.com/fiona-grimm/ed0e2c42d2fe8cbbb72b3070f99444f0

    Dealing with shorter data gaps

    Next, we checked how many missing or zero values were left:

    https://gist.github.com/fiona-grimm/e89a26740095479cea7b0dac783048d7

    Most of the remaining gaps (42 out of 54) consisted of only a single day and they were mostly found in variables relating to long-stay patients. To judge whether these looked like real ‘zero events’ or were more likely to be reporting errors, we had a closer look at the data:

    https://gist.github.com/fiona-grimm/8e37c125cb239b2ce00281acedbeaf9c

    Before cleaning: plots showing the subset of trusts reporting ‘0’ in any (non-derived) variable.

    Based on the data before and after the zeroes, these were unlikely to be true values. It would have been possible to impute these gaps in some way, for example by taking the mean of day before and the day after. Instead, we took the approach that required fewest assumptions and we just replaced the gaps with NA:

    https://gist.github.com/fiona-grimm/70f002e5b2d169bf629fc8463a059946

    After cleaning: plots showing the same trusts after replacing zeros with NA.

    Validating derived variables

    Some variables present in the data set were derived from others: for example, total.beds.occd should be the sum of core.beds.open and escalation.beds.open.

    We could have discarded derived variables straightaway and then computed them ourselves, in order to be completely sure about how they have been derived and what they mean. Since we were curious about their quality, we first checked if total.beds.open and occupancy.rate had been calculated as expected so we could decide whether or not to replace them (spoiler: yes, we did).

    https://gist.github.com/fiona-grimm/d7be2ac9b3f83596f48733a65f525063

    Similarly, if it had been the focus of our analysis, we would also have rederived national aggregates for England.

    5.  Feature engineering

    Adding organisational information on sustainability and transformation partnerships (STPs)

    As we wanted to compare indicators between local areas, we decided to calculate averages of winter indicators over hospital trusts within each STP. To do this, we needed to add variables to the raw data that indicated which hospital trust belonged to which STP. Unfortunately, we were not aware that this information was available in a format convenient for data analysis.

    So, we manually created and validated a lookup table to map hospital trusts to STPs, using information related to the 2017/18 formation of 44 STPs from NHS England. While some STPs have since changed (for example, three STPs in the north of England merged in August 2018), this was the latest and most comprehensive information available, as far as we are aware.

    The allocation of most hospital trusts to STPs was straightforward using this list, but there were a few instances where we had to choose:

    • If a trust did not appear in any STP list, it was matched according to the location of its main acute site. This was the case for four trusts in Northumberland, Tyne and Wear and North Durham STP.
    • If a trust was mentioned in more than one STP plan, it was allocated according to the location of its main acute site. This applied to both Chesterfield Royal Hospital NHS Foundation Trust and Epsom And St Helier University Hospitals NHS Trust.

    We think this is a reasonable approach when analysing winter indicators, which mainly come from acute trusts, but we would be keen to hear your feedback.

    Once we had this lookup table, we imported it into R and merged it with the winter indicators:

    https://gist.github.com/fiona-grimm/b7fd87ba3c8ecbc1236c48c6ef8b19a0

    The lookup table is also available on Github. Please note that STPs change and develop over time, so if you would like to use it, it’s worth checking that the information is up to date.

    Making read-outs comparable between trusts: from raw counts to rates

    As hospital trusts come in different sizes and serve different numbers of patients, raw patient counts are not suitable for comparisons between trusts or regions. Percentages or fractions, such bed occupancy rates, are more comparable.

    Therefore, we derived the fraction of occupied beds, which are occupied by long-stay patients over 7, 14 or 21 days:

    https://gist.github.com/fiona-grimm/e984a41eb48058ef4ca5254412f8aa7e

    6.  Aggregation: monthly averages by STP

    In some cases, it might be useful to visualise daily changes, but weekly or even monthly aggregates have the advantage of being less noisy, free of weekday-weekend variation and can potentially be more useful to monitor longer-term trends.

    First, we created a new column that contained the corresponding month. The month was then used as grouping variable, along with the trust or STP code, to calculate monthly averages of bed occupancy and long-stay patient rates.

    For weekly averages, an alternative would have been to create a new column containing the date of the respective beginning of the week using the cut() function (also shown below).

    We know it’s also good practice to keep track of the number of valid observations (as in, not NA) that we average over within each group used. In this instance, for trust-level aggregates, this represented the number of days within a week. For STP-level aggregates, it corresponded to the number of trusts within the STP.

    https://gist.github.com/fiona-grimm/fed7f70f891f9a017dd7bb097e6f2173

    At this point, we also saved the tidy and aggregated data as a CSV file for visualisation in other programs, such as Tableau.

    7.  Visualisation: how we mapped STP-level data in R

    There are endless ways to creatively visualise aspects of this data (the R Graph Gallery is a great place to get inspired). We wanted to plot a map of STP boundaries and colour-shade them according to the average bed occupancy in each winter month.

    STP boundaries are available as a GeoJSON file from the Office for National Statistics (ONS). We downloaded and imported the digital vector file and then created a plot to get a first idea of what was in the file:

    https://gist.github.com/fiona-grimm/db0b883148b7f95bc3546cf2721e83df

    Boundaries of Sustainability and Transformation Partnerships in England (2017). Data source: ONS.

    Before spatial objects can be used with ggplot, they have to be converted to a data frame. This can be done using the tidy() function from the broom package. To add our data to the map, we then merged the resulting data frame with the winter data.

    The cut() function provided a convenient way to divide the variable bed.occupancy into meaningful intervals and to add labels that could be displayed on the map. Converting variables into factors and ordering the factor levels using factor() ensured that the intervals and months were in the right order. We were then ready to plot and save the map:

    https://gist.github.com/fiona-grimm/f4c7437b18d73c73c4e7ad16bbae479e

    How could we have improved this map further? One option might have been to interactively display STP averages when hovering over the map using the plotly package, for example.

    8.  What have we learned?

    Lots! No, seriously… Using this workflow, rather than copy and paste, has saved us a lot of time this winter. But beyond that, creating (then revisiting and improving) this workflow turned out to be a great way to work together and to make our analysis more transparent, more reproducible and easier to understand.

    Key points we are taking away:

    • With trusts, CCGs, STPs, ICSs, and more to choose from when we’re looking at local variation within health care systems, it can be challenging to pick the right organisational level. Local health care structures are also changing and evolving rapidly. We need to do more thinking about which level is most informative and compare different option, suggestions are welcome.
    • Some cleaning is a must. Winter data has a quick turnaround and NHS Digital only perform minimal validation. However, compared to previous years, the 2018/19 winter data quality has markedly improved (you can use our code to check the improvement for yourself).
    • But cleaning decisions impact interpretability. If the list of excluded trusts changes between years (based on how well they are reporting in any given year), this makes the data less comparable year-on-year.
    • We’ll play it safe with derived variables.The effort of rederiving them ourselves was worth it for the peace of mind we got from knowing that they were calculated consistently and meant exactly what we thought they meant.
    • Next winter, our future selves will be very happy to have our analysis nicely documented with code that is ready to go. It should even be easy to adapt the same workflow for other open NHS data sets (we’ll save this for another time).

    This blog was written by Fiona Grimm, Senior Data Analyst in the Data Analytics team at the Health Foundation. @fiona_grimm

  • A simple function to create nice correlation plots

    The problem

    I was working with a dataset where I wanted to assess the correlation of different variables in R. As much as I like R – the outputs from the console window leave something to be desired (in terms of data visualisation). Therefore, I wanted a way to visualise these correlations in a nicer / cleaner / crisper way. The solution to this is to use a correlation plot.

    Loading the correlation plot package

    The package I used for creating my correlation plots was the corrplot package, this can be installed and loaded into the R workspace by using the syntax below:

    #install.packages("corrplot")
    library(corrplot)

    At this point I would encourage you to check out help for the corrplot function, as it allows you to pass a multitude of parameters to the function.

    Deconstructing the function

    As mentioned previously, this plotting function has a multitude of uses, but all the parameters can be off putting to a newbie! This was me 6 years ago vigorously typing ‘how to do this with R relating to x’ and bombarding  stackoverflow and other useful websites with questions. Shout out to RBloggers as well!

    The function I have created uses the functionality of the corrplot packages, but it simplifies the inputs. I will include the function in stages to explain each step, however, if you just want to use the function and are not bothered with the underpinnings then skip the following section:

    Step 1 – Function parameters

    Parameters of the function are as below:

    create_gh_style_corrplot <- function(df_numeric_vals,
                                         method_corrplot,
                                         colour_min,
                                         colour_middle,
                                         colour_max="green") {

    The parameters to pass to the function are:

    1. df_numeric_vals this means a data frame of numeric values only, so any categorical (factor) data needs to be stripped out before passing the data frame to the function;
    2. method_corrplot this is a numeric range from 1 – 5. So, for a shaded correlation plot you would use 1. Further examples of the various options will be discussed when I describe how the if statement works.
    3. colour_min this uses a gradient colour setting for the negative positive correlations. An example of an input here would be “green”.
    4. colour_middle this is the middle range colour, normally I set this equal to (=) “white”.
    5. colour_max this is the colour of the strong positive correlations

    For information on the strength of correlations, refer to this simple guide.

    Step 2 – Creating the conditional (IF statement) to select correlation plot type

    The below conditional statement uses the input of the function e.g. 1-5 to select the type of chart to display. This is included in the code block below:

    library(corrplot)
    
    if(method_corrplot == 1 ){
      type_var <- "shade"
      method_corrplot = type_var 
    }
    else if (method_corrplot ==2) {
      type_var <- "number"
      method_corrplot = type_var
    }
    else if (method_corrplot ==3) {
      type_var <- "pie"
      method_corrplot = type_var
    }
    else if (method_corrplot ==4) {
      type_var <- "ellipse"
      method_corrplot = type_var
    }
    else if (method_corrplot ==5) {
      type_var <- "circle"
      method_corrplot = type_var
    }
    else{
      type_var <- "shade"
      method_corrplot <- type_var
    }

    What does this do then? Well firstly nested in the function I make sure that the corrplot library is referenced to allow for the correlation plot functionality to be used. The next series of steps repeat this method:

    • Basically, this says that if the method_corrplot parameter of the function equals input 1, 2, 3, etc – then select the relevant type of correlation plot.
    • The type_var is a variable that sets the value of the variable equal to the string stated. These strings link directly back to the parameters of the corrplot function,  as I know a type of correlation plot is equal to shade or number, etc.
    • Finally, the last step is to convert method_corrplot equal to the textual type specified in the preceding bullet.

    In essence, what has been inputted as numeric value into the parameter i.e. 1; set the type_var equal to a text string that matches something that corrplot is expecting and then set the method_corrplot variable equal to that of the type variable. Essentially, turning the integer value passed into the parameter into a string / character output.

    Step 3 – Hacking the corrplot function

    As specified in the previous sections, this function has a lot of inputs and is in need of simplifying, so that is exactly what I have tried to do. The corrplot function is the last step in my more simple function to take lots of parameters and simplify down to just 5 input parameters:

    corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
             order = "AOE",
             addCoef.col = 'black',
             number.cex = 0.5,
             tl.cex = 0.6,
             tl.col = 'black',
             col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
             cl.cex = 0.3)
    }

    Let’s explain this function.

    So, the corrplot function is the main driver for this and the second nested cor is just as important, as this is the command to create a correlation matrix. The settings are to use the df_numeric_vals data frame as the data to use with the function, the use=’all.obs’ just tells the function to use all observations in the data frame and the method=method_corrplot uses the if statement I created in step 2 to select the relevant chart from the input.  The order uses the angular ordering method and the addCoef.col=’black’ sets the coefficient values to always show black, as well as the colour of the labels. The background colour of the correlation plot uses the colorRampPalette function to create a gradient scale for the function and the parameters of each of the colour settings like to those inputs I explained in step 1.

    The full function is detailed here:

    create_gh_style_corrplot <- function(df_numeric_vals,
                                         method_corrplot,
                                         colour_min,
                                         colour_middle,
                                         colour_max="green") {
     
      library(corrplot)
     
      if(method_corrplot == 1 ){
        type_var <- "shade"
        method_corrplot = type_var 
      }
      else if (method_corrplot ==2) {
        type_var <- "number"
        method_corrplot = type_var
      }
      else if (method_corrplot ==3) {
        type_var <- "pie"
        method_corrplot = type_var
      }
      else if (method_corrplot ==4) {
        type_var <- "ellipse"
        method_corrplot = type_var
      }
      else if (method_corrplot ==5) {
        type_var <- "circle"
        method_corrplot = type_var
      }
      else{
        type_var <- "shade"
        method_corrplot <- type_var
      }
    
     
    corrplot(cor(df_numeric_vals, use = 'all.obs'), method = method_corrplot,
             order = "AOE",
             addCoef.col = 'black',
             number.cex = 0.5,
             tl.cex = 0.6,
             tl.col = 'black',
             col= colorRampPalette(c(colour_min, colour_middle, colour_max))(200),
             cl.cex = 0.3)
    }
    }

    If you want to use the function, just copy and paste this code into a R script file and this will create the function for you. Please remember to install the corrplot package by using install.packages(corrplot).

    Utilising the function

    The example dataset I will use here is the mpg sample file provided by ggplot. Load the R script provided towards the end of the last section first, as this will create the function in R’s environment. Next, add this code to the end to look at the various different iterations and charts that can be created from the data:

    ##------------------CREATE DATASET---------------------------------------
    
    numeric_df <- data.frame(mpg[c(3,5,8,9)])
    
    #This relates to the numeric variables in the data frame to use with my function
    ##------------------USE FUNCTION-----------------------------------------
    
    create_gh_style_corrplot(numeric_df,1, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,2, "steelblue2","black", "black")
    create_gh_style_corrplot(numeric_df,3, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,4, "steelblue2","white", "whitesmoke")
    create_gh_style_corrplot(numeric_df,5, "steelblue2","white", "whitesmoke")

    The outputs of the charts are reliant on the correlation plot type select 1-5, and the colour ranges selected. You can choose any colour and I would recommend using the command colours() in R console or script to pull up the list of colours native to R.

    How about these visualisations:

     

    Each plot can be tailored to suite your needs. I tend to like blue shades, but go all out and choose whatever colours you like. The R source code is accessible here.

    I do hope you will use this function to maximise your correlation plots – its all about relationships!

    This blog was written by Gary Hutson, Principal Analyst, Activity & Access Team, Information & Insight at Nottingham University Hospitals NHS Trust, and was originally posted here.

  • Local Public Health joins the paRty

    Having recently attended an R meetup in Birmingham, hearing of various user groups and hackathons that take place around certain technologies, I was getting a feeling that there was an increasing desire in the public health community to learn more about R and modern data science tools and techniques. I wondered whether there would be interest in a data science and R user group for the West Midlands public health intelligence community. I thought I’d raise the idea at the West Midlands Public Health Intelligence Group (WMPHIG) when I attended the quarterly meeting, but another attendee beat me to it and doing so confirmed there was some interest. I volunteered to arrange the first user group and Public Health England (PHE) kindly offered assistance.

    Between us we setup a date and venue for the first meeting at the PHE offices in Birmingham and I was pleased to hear from Nicola at PHE that “…Tickets are selling like hotcakes! “

    Not knowing exactly how the group would best work, we suggested a loose structure for the meeting with the following discussion points:

    • How this group should work
    • Assess current levels of knowledge/experience
    • R training requirements
    • R learning methods
    • Public health R use examples (including Fingertips R) & wider use examples
    • What could be done with R / What else do people want to do with R
    • Challenges and issues people have experienced/are experiencing
    • Possible joint projects that might benefit all members

    We ended up staying reasonably on topic, but there was plenty of useful and engaging discussion around the topics of data science and R. The was a nice mix of novice and more advanced R users (though no one admitted to being an expert 😉 ) in the group.  Many of those who were more advanced had fairly recently been novice users. Whilst the more advanced users were able to share their experiences of their learning journeys, others were able to contribute on how we might develop use of data science and R in Public Health Intelligence. I was also impressed with some of the examples of R use that were shared with the group by analysts who have only been using it for a relatively short time. A key point shared was though R may seem a bit daunting at first, its worth jumping in and getting your analytical hands dirty!

    A number of attendees had also managed to attend the NHS-R Community Conference and shared positive experiences of the day and the knowledge they’d picked up.

    Everyone appeared to agree that R and other modern data science tools/methods can offer a lot to public health intelligence.  There also appeared to be a desire to work together and help each other out on this learning journey. With that spirit in mind, we have agreed to share code and other useful information on K-Hub (https://khub.net/) and another meeting is going to be arranged for next quarter.

    Thanks to all that attended and contributed and to PHE for helping with the organisation.

    This blog was written by Andy Evans, Senior Officer at Birmingham City Council.

  • NHS-R Conference: was it worth it?

    This blog was written by Zoe Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Foundation Trust.

    Soon after I started in this role I was offered several free places to a conference on Big Data. It was aimed at Healthcare and with Big Data in the title it sounded very exciting. Sadly, like many things around ‘Big Data’ it was baffling. There were a few good, high-level, presentations but it turned out the conference was funded by private companies and they had bought the opportunity to showcase their wares. I didn’t learn anything practical and I would be hard pressed to recall anything from the event now. All in all, it was a disappointment.

    Time passed and I tentatively started messing around with R. Not being someone who can learn well from books and videos, I learn better from someone telling me what they’ve written, I went to a couple of free R training events, one of which was run by NHS-R. At that training there was talk of an NHS-R conference and I couldn’t help but get that familiar tingle of excitement. Yet, in the back of my head now was the nagging feeling that it might be like that Big Data conference. After all, these things are not easy to set up, I was still overwhelmed by what R could do and still just learning and, well, how good could it really be?

    Optimism won out, along with the fact that the conference was free and a few colleagues were going so I could get a lift. I had nothing to lose. I didn’t even have to build a case to go for my managers let alone myself. At worst I could expect a day out and a free coffee, maybe a biscuit. It was settled. I would even take my own lunch.

    It’s been a fair number of weeks since the event, which is a pretty good time to evaluate something like this. I always find that there is a buzz after any conference or meeting. The buzz of course can be as much negative as positive energy so it’s best to wait a few days before rviewing, I think. Plus, I’m not that used to writing down my thoughts. After all of my doubts then, did it live up to my vague and overly excitable expectations? Oh yes, it did. In buckets.

    I’ve worked with some intelligent and creative people in the NHS and it turns out these weren’t isolated pockets of great people doing great work. The NHS, and other organisations close to it, are chock full of these people. We had speakers from Public Health, universities, foundations, private consultancies, Acute and Mental Health Trusts. In fact, I know it’s called NHS-R but it’s not an exclusive club and nor should it be.

    We were shown finished pieces of work in R as well as the code itself which I really enjoyed. I’m an analyst and I like to see the logic and the code as that’s where I work – in the detail. Best of all though was the networking. I met a number of people I knew either from courses or from their blogs/R packages. Unlike the ‘Big Data’ conference where I had to track someone down on LinkedIn to ask about one of the presentations, and sadly heard nothing, I have contacted some of the speakers through Twitter and email and had conversations. There is a sense that everyone is in this together and that competitive edge you sometimes get in IT is notably missing.

    The other great thing about this conference was that it was free and that should not be underestimated. Being free, it attracted those non-management, just starting out in R, not even statisticians and only did maths to GSCE level people like me. The NHS, like many other public-sector organisations, are feeling the pinch and more than ever we need support and training for the hidden side of the NHS – its data people. And it’s those people I hope you will see at the next NHS-R conference showing their achievements and sharing their enthusiasm, because, whilst this was a training/conference event, really it was the coming together of a community.

     

  • NHS open source public datasets – creating realistic synthetic datasets

    This blog is an initial attempt to garner interest in a project to create NHS synthetic datasets in a range of fields and also to understand the underlying principles around creating synthetic healthcare data and its intricacies.

    Healthcare data is increasingly electronic. With the huge datasets that the NHS is in the process of collecting for patient care comes an impetus to standardise data collection between hospitals and trusts. This is good for patient care not least because it allows for standardised analysis of large datasets. This standardisation is in progress across many of the medical specialities but several have already reached a mature stage. A good example is the National Endoscopy Database which aims to automatically collect data from all endoscopic procedures in the UK according to a standardised template. This will allow for the analysis of variation in endoscopic performance, quality and outcomes amongst many other outputs.

    The analysis of these datasets will need to be validated, reproducible and of course progressive as the desired metrics change. The analyses will therefore require two things

    1. Ongoing input from analysts to maintain the methodology and code base to perform the analysis.
    2. Creative ideas for the representation of the datasets.

    NHS datasets and the methods for their manipulation are likely to attract a lot of interest from diverse sources such as pharmaceutical companies to healthcare software developers and academic researchers. By restricting access to the datasets because of privacy issues we also, by necessity restrict the speed at which solutions can be found using these datasets as analysis will only be carried out by a small sample of authorised analysts

    The obvious solution here is to create NHS datasets which are constructed according to accepted and used data templates, but to populate them with synthetic data and then to allow open source access to these datasets.

    Synthetic data is not always easy to create. The vast majority of NHS electronic data is still semi-structured free text. Reports also have to make sense internally so that, for example, an endoscopy report describing a stomach ulcer has to contain text that is relevant to the ulcer finding. It gets even more complex when further reports are written that reference a report in another dataset. An example is the histopathology report from a biopsy taken from the stomach ulcer. This biopsy report will obviously have to be reporting on the stomach ulcer and the text will be about pathology findings relevant to the description of an ulcer.

    An example of an attempt at creating a synthetic medical dataset can be found here using the above example (https://github.com/sebastiz/FakeEndoReports). This contains some description of how the reports are created and I have tried to derive some principles regarding how to make fake healthcare datasets in general based on this example.

    There are of course many other datasets that would be incredibly powerful if they were created an open sourced. One such datasets is the NHS patient administration system on which most statistics about waiting times and patient pathways are based. Another is the Hospital Episode Statistics (HES) which collect information regarding all NHS appointments (in-patient and outpatient) and which is being used to create data linkage between a wide range of data repositories (My attempt at creating synthetic HES data can be found here but is still incomplete at the moment: https://github.com/sebastiz/HesMineR)

    R is the perfect language to create such synthetic datasets and it would be a valuable addition to the NHS-R armamentarium to have a package that contained synthetic NHS datasets so that open source solutions can be more quickly and creatively derived.