• We need you: share your COVID-19 work on NHSE/I’s regular mini-huddle series

    COVID-19 has placed the value of data and analytics at the fore. This has meant an unprecedented focus on the work of health and care analysts, but also huge demand for local analytical services. The need to share and share alike is paramount.  To support this, NHSE/I are holding a series of huddles and mini-huddles to allow analysts to share their COVID-19 work with a wider community.

    Weekly huddles showcase work on a specific topic, and the back catalogue of recordings can be found here. The mini-huddles are a spin off series where speakers get into the detail behind a single topic, like geo-mapping or a dashboard. The team curating these huddles are keen to have a mini-huddle on COVID-19 analysis in R. If you are interested in sharing your analysis relating to COVID-19 using R, please get in touch with Sophie Hodges via email sophie.hodges5@nhs.net

     If you haven’t already, please sign up to the COVID-19 data and analytics workspace here. The community is currently 6500-strong with members across the NHS, public health, local authorities, charities and commercial organisations. Mini-huddles usually draw an audience between 30 and 80 attendees so it’s a great opportunity to share your work, discuss findings and network with others working on similar issues.

    Please get in touch if this sounds of interest. And a huge thank you from NHSE/I. The work being done by the analytical community is a key plank of the COVID-19 effort.

  • phsmethods: an R package for Public Health Scotland

    by Jack Hannah


    On Wednesday 1st April 2020, Public Health Scotland (PHS) came into being. It was formed as a result of a three-way merger between NHS Health Scotland and the Information Services Division (ISD) and Health Protection Scotland (HPS) sections of Public Health and Intelligence (PHI) (which was in itself a strategic business unit of NHS National Services Scotland (NSS)). It’s fewer acronyms to remember at least.

    The Transforming Publishing Programme (TPP) was devised in 2017 in an attempt to modernise the way in which ISD produced and presented its statistics. Traditionally, work within ISD had been undertaken using proprietary software such as Excel and SPSS, with output presented predominantly in the form of PDF reports. It required a great deal of manual formatting, caused an even greater deal of frustration for analysts with programming skills, and resulted in more copy and paste errors than anyone would care to admit.

    Over time, ISD gradually (and at times begrudgingly) came to embrace open source software – predominantly R and, to a lesser extent, Python. TPP were at the forefront of much of this: creating style guides; working with producers of official and national statistics to convert PDF reports into Shiny applications; producing R packages and Reproducible Analytical Pipelines; and introducing version control, among other things. Now, with the move to PHS complete, TPP’s purview has broadened: not only to further the adoption of R, but to ensure it’s adopted in a consistent manner by the hundreds of analysts PHS employs.

    Introducing phsmethods

    Analysts working across a multitude of teams within PHS have to perform many of the same, repetitive tasks, not all of which are catered for by existing R packages: assigning dates to financial years in YYYY/YY format (e.g. 2016/17); formatting improperly recorded postcodes; returning quarters in plain English (e.g January to March 2020 or Jan-Mar 2020) rather than the slightly restrictive formats offered by lubridate::quarter() and zoo::yearqtr(). The list is endless, and every analyst has their own workaround, which quickly becomes a problem. Time is wasted by multiple people devising an alternative way of doing the same thing – and how can anyone be sure that everyone’s method actually does the same thing?

    The phsmethods package was conceived to address (some of) these concerns. At the time of writing, it contains eleven functions to facilitate some of the more laborious tasks analysts regularly face, with more in the works. None deal with any statistical methodology, nor are they likely to provoke any controversy or consternation over their methods of implementation; they are simple functions designed to make routine data manipulation easier.

    phsmethods isn’t on CRAN. Perhaps it will be at some point, but there hasn’t been any need to make it available more widely thus far. phsmethods does, however, come with many of the features one would expect from a CRAN package: function documentationunit testscontinuous integrationcode coverage; and releases. No hex sticker yet, but maybe one day.

    Using phsmethods

    Practical examples featuring all of phsmethods’ functions are available in the package’s README and no one will make it to the end of this blogpost if they’re all regurgitated here. But hopefully one demonstration is okay. Consider the following fictitious dataset containing postcodes:

    df <- tibble::tribble(
      ~patient_id, ~postcode,
      1,           "G26QE",
      2,           "KA8 9NB",
      3,           "PA152TY ",
      4,           "G 4 2 9 B A",
      5,           "g207al",
      6,           "Dg98bS",
      7,           "DD37J    y",
      8,           "make",
      9,           "tiny",
      10,          "changes"
    ); df
    ## # A tibble: 10 x 2
    ##    patient_id postcode     
    ##         <dbl> <chr>        
    ##  1          1 "G26QE"      
    ##  2          2 "KA8 9NB"    
    ##  3          3 "PA152TY "   
    ##  4          4 "G 4 2 9 B A"
    ##  5          5 "g207al"     
    ##  6          6 "Dg98bS"     
    ##  7          7 "DD37J    y" 
    ##  8          8 "make"       
    ##  9          9 "tiny"       
    ## 10         10 "changes"

    This is a problem that analysts across the NHS and beyond face regularly: a dataset containing postcodes arrives, but the postcodes are recorded inconsistently. Some have spaces; some don’t. Some are all upper case; some are all lower case; some are a combination of both. And some aren’t real postcodes. Often this dataset is to be joined with a larger postcode directory file to obtain some other piece of information (such as a measure of deprivation). Joining these datasets tends not to be a trivial task; sometimes a fuzzyjoin may suffice, but often it requires an arduous process of formatting the postcode variable in one dataset (or both datasets) until they look the same, before combining using a dplyr join.

    In PHS, postcodes typically follow one of two formats: pc7 format (all values have a string length of seven, with zero, one or two spaces before the last three digits as necessary); or pc8 format (all values have one space before the last three digits, regardless of the total number of digits). phsmethods::postcode() is designed to format any valid postcode, regardless of how it was originally recorded, into either of these formats. Consider the earlier dataset:

    df %>%
      dplyr::mutate(postcode = phsmethods::postcode(postcode, format = "pc8"))
    ## # A tibble: 10 x 2
    ##    patient_id postcode
    ##         <dbl> <chr>   
    ##  1          1 G2 6QE  
    ##  2          2 KA8 9NB 
    ##  3          3 PA15 2TY
    ##  4          4 G42 9BA 
    ##  5          5 G20 7AL 
    ##  6          6 DG9 8BS 
    ##  7          7 DD3 7JY 
    ##  8          8 <NA>    
    ##  9          9 <NA>    
    ## 10         10 <NA>
    df %>%
      # The format is pc7 by default
      dplyr::mutate(postcode = phsmethods::postcode(postcode)) %>%
      dplyr::pull(postcode) %>%
    ##  [1]  7  7  7  7  7  7  7 NA NA NA

    In hindsight, maybe it should have been called something other than postcode() to avoid the whole postcode = postcode(postcode) bit, but everyone makes mistakes when it comes to naming functions.

    phsmethods::postcode() isn’t designed to check whether a postcode actually exists; it just checks whether the input provided is a sequence of letters and numbers in the right order and of the right length to theoretically be one and, if so, formats it appropriately. If not, it returns an NA. Handily, phsmethods::postcode() comes with some warning messages to explain how many values were recorded as NA, why they were recorded as NA, and what happens to lowercase letters:

    warnings <- testthat::capture_warnings(df %>% dplyr::mutate(postcode = phsmethods::postcode(postcode)))
    ## 3 non-NA input values do not adhere to the standard UK postcode format (with or without spaces) and will be coded as NA. The standard format is:
    ## • 1 or 2 letters, followed by
    ## • 1 number, followed by
    ## • 1 optional letter or number, followed by
    ## • 1 number, followed by
    ## • 2 letters
    ## Lower case letters in any input value(s) adhering to the standard UK postcode format will be converted to upper case

    Next steps

    The seven functions which comprised the 0.1.0 release of phsmethods were all written by members of TPP. However, the package is not intended to be a vanity exercise for the team. Those initial functions were designed to get the package off the ground, but now contributions are open to be made by anyone in PHS. With that in mind, the 0.2.0 release contains four functions written by PHS analysts not part of TPP.

    Unlike many R packages, phsmethods will, at any given time, have two or three maintainers, each with an equitable say in the direction the package takes. When one maintainer moves on or gets fed up, someone else will be sourced from the pool of analysts, and the show will go on. In theory at least.

    There are moderately extensive guidelines for anyone who wishes to contribute to phsmethods to follow. Proposed contributions should be submitted as GitHub issues for the maintainers to consider and discuss. If approved, the contributor makes a branch and submits a pull request for one or more of the maintainers to review. The hope is that, by creating an issue prior to making a contribution, no one will devote time and energy to something which can’t be accepted, and that no duplication of effort will occur in the case of multiple people having the same idea.

    It’s not expected that everyone who wishes to contribute to phsmethods will understand, or have experience of, all of the nuances of package development. Many won’t have used version control; most won’t have written unit tests or used continuous integration before. That’s okay though. The only requirement for someone to contribute code to phsmethods is that they should know how to write an R function. The package maintainers are there to help them learn the rest.

    Hopefully phsmethods will provide a safe incubator for analysts who wish to develop their skills in writing, documenting and testing R functions prior to making their own open source contributions elsewhere. And hopefully the knock-on effect of more analysts developing these skills will be improved coding standards permeating throughout PHS. If not, it’ll at least look good having a package pinned on my GitHub profile.

  • Exact Matching in R

    Exact matching in R: a case study of the health and care costs associated with living alone, using linked data from the London Borough of Barking & Dagenham

    I’ve been working with a group of analysts in East London who are interested in joined-up health and social care data. They’ve created a powerful, unique dataset that shows how each resident of the London Borough of Barking & Dagenham interacts with NHS and council services. The possibilities are enormous, both in terms of understanding the way that people use services across a whole system, and for more general public health research.

    You can download a sample of the data here:

    Today we’re interested in whether social isolation is related to healthcare costs, and we’re going to use exact matching to explore this issue. Our theory is that people who live alone have higher healthcare costs because they have less support from family members.

    Here’s an extract from the data. ‘Total cost’ means the cost of health and social care in one year. The actual dataset has many more variables!

    ID Age groupSex Number of long-term conditionsLives aloneTotal cost
    1 50-64 Male 0 No £93
    2 65-74 Female 1 Yes £0
    3 50-64 Male 4 No £1065
    4 85+ Female 5 Yes £7210

    For the purposes of this blog, we’ll use a sample of 5000 (out of 114,000) individuals, with some changes to the values to ensure anonymity.

    Descriptive stats

    The first thing we’ll look at is whether people who live alone are different. I find descriptive tables fiddly so often use a function that can be re-used across a number of variables:

    # first, save the data provided above and change your working directory to the place you've saved it with setwd()
    # read data
    library(data.table) # useful for efficient reading and summarising of data
    d <- fread("sample_lbbd_dataset.csv")
    # describe age, sex and number of long-term conditions
    describe.by.isolation <- function(variable) {
    a <- table(unlist(d[, variable, with = F]), d$livesalone)
     p <- round(prop.table(a, 2) * 100, 0)
     matrix(paste0(a, ' (', p, ')'), ncol = 2, dimnames = list(row.names(a), c('Live with others', 'Lives alone')))
    lapply(c('age_grp', 'sex', 'LTC_grp'), describe.by.isolation)
    # describe healthcare costs
    d[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(d$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(d$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    People who live alone are older, more likely to be female and have more long-term health problems. Their mean healthcare costs are £2,347 higher. The difference in healthcare costs is visible on the histograms, which we have displayed on a log scale because some values are extremely high. There’s some ‘zero inflation’ in both groups – people with no healthcare costs who do not fit into the otherwise lognormal-ish distribution. So far not too surprising – but are the increased healthcare costs explained by the differences in age, sex and health?


    One approach would be to use regression. We could fit a linear model – this is actually not a great fit for patients with higher healthcare costs, but we won’t go into that here.

    linear_model <- lm(cost ~ livesalone + age_grp + sex + LTC_grp, d)
    plot(linear_model) # diagnostics show that a linear model is not a great fit. You might have to press return to see all the plots before you can continue.

    The results suggest that health and care costs for people who live alone are £892 more than those who do not live alone, on average. Clearly the variables we added to the model are important confounders, as this is much lower than the ‘crude’ difference of £2,347. Two limitations of this approach are that we will have to think quite carefully about the fit of the model to the data, and that we can’t describe how living alone affects the distribution of costs.

    Exact matching

    We therefore used exact matching, in which each individual who lives alone is matched to an individual who does not live alone, based on specified variables such as age group and sex. We developed a function that does this by stratifying the data and then matching ‘cases’ and ‘controls’ randomly. Unmatched individuals are deleted from the dataset, leaving a dataset that is balanced in terms of the variables you specify. Let me know here if there’s any other functionality you want and we can try and incorporate it.

    Let’s try matching on age group, sex and the grouped number of long-term conditions:

    # data = dataset containing:
    # - treatment/exposure variable 'mvar' (a string specifying variable name).
    # - matching variable 'mvar' (a string specifying variable name). If you want to match on multiple variables, concatenate them first.
    # other inputs are:
    # - ratio of cases:controls (an integer > 0)
    # - seed for fixing random selection of cases/controls (an integer; default NULL means no seed). Choice of seed is arbitrary.
    # returns data.table of matched observations, with additional variable 'id' for use in paired/grouped analyses
    smatch <- function (data, treat, mvar, ratio = 1, seed = NULL) {
     setnames(data, mvar, '.mvar')
     targ <- data[, .(case = sum(get(treat)), control = sum(!get(treat))), .mvar]
     targ[, cst := floor(pmin(control / ratio, case))]
     targ[, cnt := cst * ratio]
     targ <- targ[cst > 0]
     l2 <- cumsum(targ$cst)
     ids <- mapply(':', c(0, l2[-nrow(targ)]), l2-1)
     names(ids) <- targ$.mvar
     case <- NULL
     control <- NULL
     x <- .Random.seed
     on.exit({.Random.seed <- x})
     for(i in targ$.mvar) {
       case[[i]] <- data[get(treat) == T & .mvar == i][sample(.N, targ$cst[targ$.mvar == i])]
       case[[i]][, id := ids[[i]]]
       control[[i]] <- data[get(treat) == F & .mvar == i][sample(.N, targ$cnt[targ$.mvar == i])]
       control[[i]][, id := rep(ids[[i]], each = ratio)]
     rbindlist(c(case, control))
    # create a single variable summarising matching variables
    d$mvar <- do.call('paste0', d[, c('age_grp', 'sex', 'LTC_grp')])
    # create 1:1 matched dataset.
    matched_data <- smatch(d, treat = 'livesalone', mvar = 'mvar', ratio = 1, seed = 74)
    # check balance: same number of individuals in each group
    dcast(matched_data, age_grp + sex + LTC_grp ~ livesalone, value.var = 'id', fun.aggregate = length)

    Now we have a dataset that is balanced in terms of age, sex and the count of long-term conditions. Let’s see how healthcare costs compare:

    matched_data[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
    # histograms of cost
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(matched_data$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(matched_data$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    # t-test (in reality you might want a paired test, and to check whether a t-test is appropriate)
    t.test(cost ~ livesalone, matched_data) # notice how wide the confidence intervals are for this reduced dataset

    # proportion with costs over £10000
    matched_data[, .(over10k = sum(cost > 10000) / .N), livesalone]

    The mean difference is £803. When we used the whole dataset, this value was even closer to the coefficient from linear regression. It’s now difficult to see a difference in the histograms, but you can easily create any description of the distribution that you like – e.g. the proportion of patients that have costs over £10,000.

    Who got matched and who didn’t?

    The point of matching was to create a comparison group of people who don’t live alone who were in some ways similar to the group who do. We probably had to delete lots of people who don’t live alone in a systematic way (e.g. men and younger people who do not live alone were more likely to be deleted). We might also have deleted some of the group who do live alone, which could be more problematic if we want to generalise our results to the population. Let’s see who got deleted…

     d[, matched := ifelse(PID %in% matched_data$PID, 'matched', 'unmatched')]
    # just looking at age group for now
    compare_matched <- dcast(d, age_grp ~ livesalone + matched, value.var = 'PID')
    compare_matched[, 'TRUE_total' := TRUE_matched + TRUE_unmatched]
    compare_matched[, lapply(.SD, function(x) x/sum(x) * 100), .SDcols = 2:6]

    You can see that some of the ‘lives alone’ group got deleted (‘TRUE_unmatched’), and they were all in the older age groups. The difference between everyone who lives alone (‘TRUE_total’) and the matched group (‘TRUE_matched’) is diluted, because the oldest groups are a relatively small part of the data. Nonetheless, I would say this is a fairly important difference. If you are concerned about generalisability to the population, you might want to restrict the analysis to people aged under 85. In the full dataset this was not be a problem (as there were lots more potential ‘controls’), but you might encounter similar issues if you match on more detailed variables.

    Final point! We’ve focused on a technical solution to exact matching. We haven’t really thought about which variables we should be matching on. This is a matter of judgement, and needs a bit of thinking before we dive into the matching process. Just like covariates in regression, the matching variables are confounders – they’re alternative explanations for an association between living alone and healthcare costs. Age and sex are clearly important. Health status is more complex. Do you want to think about long-term conditions as something that cause people to live alone, or the product of living alone? If people are sicker because they live alone, matching on the number of long term conditions might falsely reduce the association between living alone and healthcare costs.

    This blog was written by Dan Lewer, NIHR Doctoral Research Fellow / SpR Public Health. Department of Epidemiology & Public Health, UCL

    Twitter: danlewer

  • Local Public Health joins the paRty

    Having recently attended an R meetup in Birmingham, hearing of various user groups and hackathons that take place around certain technologies, I was getting a feeling that there was an increasing desire in the public health community to learn more about R and modern data science tools and techniques. I wondered whether there would be interest in a data science and R user group for the West Midlands public health intelligence community. I thought I’d raise the idea at the West Midlands Public Health Intelligence Group (WMPHIG) when I attended the quarterly meeting, but another attendee beat me to it and doing so confirmed there was some interest. I volunteered to arrange the first user group and Public Health England (PHE) kindly offered assistance.

    Between us we setup a date and venue for the first meeting at the PHE offices in Birmingham and I was pleased to hear from Nicola at PHE that “…Tickets are selling like hotcakes! “

    Not knowing exactly how the group would best work, we suggested a loose structure for the meeting with the following discussion points:

    • How this group should work
    • Assess current levels of knowledge/experience
    • R training requirements
    • R learning methods
    • Public health R use examples (including Fingertips R) & wider use examples
    • What could be done with R / What else do people want to do with R
    • Challenges and issues people have experienced/are experiencing
    • Possible joint projects that might benefit all members

    We ended up staying reasonably on topic, but there was plenty of useful and engaging discussion around the topics of data science and R. The was a nice mix of novice and more advanced R users (though no one admitted to being an expert 😉 ) in the group.  Many of those who were more advanced had fairly recently been novice users. Whilst the more advanced users were able to share their experiences of their learning journeys, others were able to contribute on how we might develop use of data science and R in Public Health Intelligence. I was also impressed with some of the examples of R use that were shared with the group by analysts who have only been using it for a relatively short time. A key point shared was though R may seem a bit daunting at first, its worth jumping in and getting your analytical hands dirty!

    A number of attendees had also managed to attend the NHS-R Community Conference and shared positive experiences of the day and the knowledge they’d picked up.

    Everyone appeared to agree that R and other modern data science tools/methods can offer a lot to public health intelligence.  There also appeared to be a desire to work together and help each other out on this learning journey. With that spirit in mind, we have agreed to share code and other useful information on K-Hub (https://khub.net/) and another meeting is going to be arranged for next quarter.

    Thanks to all that attended and contributed and to PHE for helping with the organisation.

    This blog was written by Andy Evans, Senior Officer at Birmingham City Council.

  • The joy of R

    Hello. My name is Julian and I am an R addict. I got hooked about 3 years ago when I took on a new role in Public Health England developing a public health data science team. My professional background is as a doctor and Consultant in Public Health and have spent the last 15 years in the health intelligence field so I thought I knew something about data and analysis. I realised I didn’t know anything about data science so I decided to do a course and ended up doing the Coursera data science MOOC from Johns Hopkins (https://www.coursera.org/specializations/jhu-data-science) because it was health related. For the course, you need to learn R – and so my habit started. (It turned out I knew nothing about data and analysis as well).

    I had done an R course 15 years ago but never used it. Any analysis I did used spreadsheets, SPSS, Mapinfo and host of other tools, and I had never written a single line of code until 3 years ago (apart some very basic SQL). That’s all changed.

    Apart from a brief obsession with Tableau a few years ago (which I still love), learning R has for me, been utterly transformational. Now my basic analytical workflow is R + Google (for getting answers when you are stuck) + Git (for sharing and storing code) + Mendeley (reference management software). That’s it.

    I barely open Excel except to look at data structure so I can import data into R; I don’t use GIS; I hardly even open word to write a document – I do that in R (like this blog); and recently the option to output to power point has appeared in R Markdown so I’ve started using that as well.

    On top of that I have learned a whole heap of analytical and other skills through using R. I feel comfortable getting and analysing data of any size, shape and complexity including text, websites, APIs, very large datasets; and quickly. I can now rapidly produce scientific reports, scrape websites, mine text, automate analysis, build machine learning pipelines, create high quality graphics using the fab ggplot2 and its relatives, have co-authored a package to read Fingertips data (fingertipsR – very proud of this) and am getting my head around regular expressions. I have even managed a couple of Shiny documents. There is nothing I have wanted to do that I can’t do in R; and a huge range of things I didn’t know you could do or had never heard of.

    So what is it about R that makes it so great? In the last 5 years it has moved from an academic stats package to a professional data science tool. One of the reasons is the development of the tidy data framework [1] and tools to make data wrangling or munging much easier. This is a much overlooked part of the analysts life – all the things you need to do with data before you can analyse it (50 – 70% of the process) has been paid serious attention and made much easier with packages like dplyr and tidyr. And a lot of attention has been made to making coding more logical and syntax more “English”. Another reason is the development of R Studio and R Markdown which give you button press outputs in a range of high quality formats. And there is a focus on reproducibility – the ability for analysis to be repeated exactly which requires combining data, analysis and results in a form others can follow. This is good science and will become much more widespread. You can do this in R and Git.

    My addiction has infected my team and the analytical community in PHE. We are spreading R rapidly and writing packages to automate routine analysis and reporting. We routinely use Gitlab to share and collaborate on code, and are introducing software development ideas like code review and unit testing. In short we are trying to help analysts (if they want to) become analyst-developers.

    There are downsides to R of course. There is a (big) learning curve, ICT get twitchy, there is a huge range of packages and any number of ways of doing things, and things often break. But as any addict would say, these are just obstacles to be overcome and there is a lot of support out there.

    R is not the only direction of travel – we do use PowerBI (running R scripts), and we do a bit of development in Python, but one thing is certain – I can’t go back to pre R days.

    So there’s my confession. I’m a data junkie and an R addict. If you want to see my descent I put stuff on an RPubs page from time to time and I have a Github page. If you want to help me – feel free to get into touch or send me a pull request.

    Thanks to Seb Fox at PHE and David Whiting of Medway Council for inspiration and support.


    1 Wickham H. 2014;59:1–23. doi:10.18637/jss.v059.i10