• phsmethods: an R package for Public Health Scotland

    by Jack Hannah


    On Wednesday 1st April 2020, Public Health Scotland (PHS) came into being. It was formed as a result of a three-way merger between NHS Health Scotland and the Information Services Division (ISD) and Health Protection Scotland (HPS) sections of Public Health and Intelligence (PHI) (which was in itself a strategic business unit of NHS National Services Scotland (NSS)). It’s fewer acronyms to remember at least.

    The Transforming Publishing Programme (TPP) was devised in 2017 in an attempt to modernise the way in which ISD produced and presented its statistics. Traditionally, work within ISD had been undertaken using proprietary software such as Excel and SPSS, with output presented predominantly in the form of PDF reports. It required a great deal of manual formatting, caused an even greater deal of frustration for analysts with programming skills, and resulted in more copy and paste errors than anyone would care to admit.

    Over time, ISD gradually (and at times begrudgingly) came to embrace open source software – predominantly R and, to a lesser extent, Python. TPP were at the forefront of much of this: creating style guides; working with producers of official and national statistics to convert PDF reports into Shiny applications; producing R packages and Reproducible Analytical Pipelines; and introducing version control, among other things. Now, with the move to PHS complete, TPP’s purview has broadened: not only to further the adoption of R, but to ensure it’s adopted in a consistent manner by the hundreds of analysts PHS employs.

    Introducing phsmethods

    Analysts working across a multitude of teams within PHS have to perform many of the same, repetitive tasks, not all of which are catered for by existing R packages: assigning dates to financial years in YYYY/YY format (e.g. 2016/17); formatting improperly recorded postcodes; returning quarters in plain English (e.g January to March 2020 or Jan-Mar 2020) rather than the slightly restrictive formats offered by lubridate::quarter() and zoo::yearqtr(). The list is endless, and every analyst has their own workaround, which quickly becomes a problem. Time is wasted by multiple people devising an alternative way of doing the same thing – and how can anyone be sure that everyone’s method actually does the same thing?

    The phsmethods package was conceived to address (some of) these concerns. At the time of writing, it contains eleven functions to facilitate some of the more laborious tasks analysts regularly face, with more in the works. None deal with any statistical methodology, nor are they likely to provoke any controversy or consternation over their methods of implementation; they are simple functions designed to make routine data manipulation easier.

    phsmethods isn’t on CRAN. Perhaps it will be at some point, but there hasn’t been any need to make it available more widely thus far. phsmethods does, however, come with many of the features one would expect from a CRAN package: function documentationunit testscontinuous integrationcode coverage; and releases. No hex sticker yet, but maybe one day.

    Using phsmethods

    Practical examples featuring all of phsmethods’ functions are available in the package’s README and no one will make it to the end of this blogpost if they’re all regurgitated here. But hopefully one demonstration is okay. Consider the following fictitious dataset containing postcodes:

    df <- tibble::tribble(
      ~patient_id, ~postcode,
      1,           "G26QE",
      2,           "KA8 9NB",
      3,           "PA152TY ",
      4,           "G 4 2 9 B A",
      5,           "g207al",
      6,           "Dg98bS",
      7,           "DD37J    y",
      8,           "make",
      9,           "tiny",
      10,          "changes"
    ); df
    ## # A tibble: 10 x 2
    ##    patient_id postcode     
    ##         <dbl> <chr>        
    ##  1          1 "G26QE"      
    ##  2          2 "KA8 9NB"    
    ##  3          3 "PA152TY "   
    ##  4          4 "G 4 2 9 B A"
    ##  5          5 "g207al"     
    ##  6          6 "Dg98bS"     
    ##  7          7 "DD37J    y" 
    ##  8          8 "make"       
    ##  9          9 "tiny"       
    ## 10         10 "changes"

    This is a problem that analysts across the NHS and beyond face regularly: a dataset containing postcodes arrives, but the postcodes are recorded inconsistently. Some have spaces; some don’t. Some are all upper case; some are all lower case; some are a combination of both. And some aren’t real postcodes. Often this dataset is to be joined with a larger postcode directory file to obtain some other piece of information (such as a measure of deprivation). Joining these datasets tends not to be a trivial task; sometimes a fuzzyjoin may suffice, but often it requires an arduous process of formatting the postcode variable in one dataset (or both datasets) until they look the same, before combining using a dplyr join.

    In PHS, postcodes typically follow one of two formats: pc7 format (all values have a string length of seven, with zero, one or two spaces before the last three digits as necessary); or pc8 format (all values have one space before the last three digits, regardless of the total number of digits). phsmethods::postcode() is designed to format any valid postcode, regardless of how it was originally recorded, into either of these formats. Consider the earlier dataset:

    df %>%
      dplyr::mutate(postcode = phsmethods::postcode(postcode, format = "pc8"))
    ## # A tibble: 10 x 2
    ##    patient_id postcode
    ##         <dbl> <chr>   
    ##  1          1 G2 6QE  
    ##  2          2 KA8 9NB 
    ##  3          3 PA15 2TY
    ##  4          4 G42 9BA 
    ##  5          5 G20 7AL 
    ##  6          6 DG9 8BS 
    ##  7          7 DD3 7JY 
    ##  8          8 <NA>    
    ##  9          9 <NA>    
    ## 10         10 <NA>
    df %>%
      # The format is pc7 by default
      dplyr::mutate(postcode = phsmethods::postcode(postcode)) %>%
      dplyr::pull(postcode) %>%
    ##  [1]  7  7  7  7  7  7  7 NA NA NA

    In hindsight, maybe it should have been called something other than postcode() to avoid the whole postcode = postcode(postcode) bit, but everyone makes mistakes when it comes to naming functions.

    phsmethods::postcode() isn’t designed to check whether a postcode actually exists; it just checks whether the input provided is a sequence of letters and numbers in the right order and of the right length to theoretically be one and, if so, formats it appropriately. If not, it returns an NA. Handily, phsmethods::postcode() comes with some warning messages to explain how many values were recorded as NA, why they were recorded as NA, and what happens to lowercase letters:

    warnings <- testthat::capture_warnings(df %>% dplyr::mutate(postcode = phsmethods::postcode(postcode)))
    ## 3 non-NA input values do not adhere to the standard UK postcode format (with or without spaces) and will be coded as NA. The standard format is:
    ## • 1 or 2 letters, followed by
    ## • 1 number, followed by
    ## • 1 optional letter or number, followed by
    ## • 1 number, followed by
    ## • 2 letters
    ## Lower case letters in any input value(s) adhering to the standard UK postcode format will be converted to upper case

    Next steps

    The seven functions which comprised the 0.1.0 release of phsmethods were all written by members of TPP. However, the package is not intended to be a vanity exercise for the team. Those initial functions were designed to get the package off the ground, but now contributions are open to be made by anyone in PHS. With that in mind, the 0.2.0 release contains four functions written by PHS analysts not part of TPP.

    Unlike many R packages, phsmethods will, at any given time, have two or three maintainers, each with an equitable say in the direction the package takes. When one maintainer moves on or gets fed up, someone else will be sourced from the pool of analysts, and the show will go on. In theory at least.

    There are moderately extensive guidelines for anyone who wishes to contribute to phsmethods to follow. Proposed contributions should be submitted as GitHub issues for the maintainers to consider and discuss. If approved, the contributor makes a branch and submits a pull request for one or more of the maintainers to review. The hope is that, by creating an issue prior to making a contribution, no one will devote time and energy to something which can’t be accepted, and that no duplication of effort will occur in the case of multiple people having the same idea.

    It’s not expected that everyone who wishes to contribute to phsmethods will understand, or have experience of, all of the nuances of package development. Many won’t have used version control; most won’t have written unit tests or used continuous integration before. That’s okay though. The only requirement for someone to contribute code to phsmethods is that they should know how to write an R function. The package maintainers are there to help them learn the rest.

    Hopefully phsmethods will provide a safe incubator for analysts who wish to develop their skills in writing, documenting and testing R functions prior to making their own open source contributions elsewhere. And hopefully the knock-on effect of more analysts developing these skills will be improved coding standards permeating throughout PHS. If not, it’ll at least look good having a package pinned on my GitHub profile.