• NHS-R Community datasets package released

    This post briefly introduces an R package created for the NHS-R Community to help us learn and teach R.

    Firstly, it is now available on CRAN, the major package repository for R, and can be installed like any other package, or directly from GitHub as follows:

    install.packages("NHSRdatasets")
    
    #or
    
    remotes::install_github("https://github.com/nhs-r-community/NHSRdatasets")


    Why?

    Several community members have mentioned the difficulties learning and teaching R using standard teaching datasets. Stock datasets like iris, mtcars, nycflights13 etc. are all useful, but they are out-of-context for most NHS, Public Health and related staff. The purpose of this package to provide examples datasets related to health, or reusing real-world health data. They can be accessed easily and used to communicate examples, learn R in context, or to teach R skills in a familiar setting.

    For those of us wanting to contribute to Open Source software, or practise using Git and GitHub, it also provides an opportunity to learn/practise these skills by contributing data.


    What’s in it?

    At present we have two data sets:

    Name Contributor Summary
    LOS_model Chris Mainey Simulated hospital data, originally for learning regression modelling
    ae_attendances Tom Jemmett NHS England’s published A&E attendances, breaches and admission data

    Both datasets have help files that explain the source, data columns, and give examples of use. The package also contains vignettes that show the datasets being put to use. You can find them in the help files, or using:

    # list available vignettes in the package:
    vignette(package="NHSRdatasets")
    
    # Load the `ae_attendances` vignette:
    vignette("ae_attendances")


    How can you contribute?

    We are aiming to build a repository of different health-related datasets. We are particularly keen to include primary care, community, mental health, learning disabilities, or public health data.

    Please head to the GitHub page, and follow the submission guidelines on the README page. If you would like to add a vignette or alter an existing one, follow the same instructions: fork the repository, commit and push your changes, then make a ‘pull request.’ We will then review your submission before merging it into the package.

    Please ensure that any data you submit is in the public domain, and that it meets all obligations under GDPR and NHS/Public Health information governance rules.

    To see the package code, or contribute, visit https://github.com/nhs-r-community/NHSRdatasets .


    Summary

    The NHSRdatasets package is a free, collaborative datasets package to give context for NHS-R community when learning or teaching R. The package is available on CRAN. We actively invite you to submit more datasets and help us build this package.

    To read the reference material, and vignettes, you can find them in the package help files or go to the pkgdown website: https://nhs-r-community.github.io/NHSRdatasets/

    Chris Mainey
    Intelligence Analyst

    Health Informatics – University Hospitals Birmingham NHS Foundation Trust

  • Exact Matching in R

    Exact matching in R: a case study of the health and care costs associated with living alone, using linked data from the London Borough of Barking & Dagenham

    I’ve been working with a group of analysts in East London who are interested in joined-up health and social care data. They’ve created a powerful, unique dataset that shows how each resident of the London Borough of Barking & Dagenham interacts with NHS and council services. The possibilities are enormous, both in terms of understanding the way that people use services across a whole system, and for more general public health research.

    You can download a sample of the data here:

    Today we’re interested in whether social isolation is related to healthcare costs, and we’re going to use exact matching to explore this issue. Our theory is that people who live alone have higher healthcare costs because they have less support from family members.

    Here’s an extract from the data. ‘Total cost’ means the cost of health and social care in one year. The actual dataset has many more variables!

    ID Age groupSex Number of long-term conditionsLives aloneTotal cost
    1 50-64 Male 0 No £93
    2 65-74 Female 1 Yes £0
    3 50-64 Male 4 No £1065
    4 85+ Female 5 Yes £7210

    For the purposes of this blog, we’ll use a sample of 5000 (out of 114,000) individuals, with some changes to the values to ensure anonymity.

    Descriptive stats

    The first thing we’ll look at is whether people who live alone are different. I find descriptive tables fiddly so often use a function that can be re-used across a number of variables:

    # first, save the data provided above and change your working directory to the place you've saved it with setwd()
    # read data
    library(data.table) # useful for efficient reading and summarising of data
    d <- fread("sample_lbbd_dataset.csv")
     
    # describe age, sex and number of long-term conditions
    describe.by.isolation <- function(variable) {
    a <- table(unlist(d[, variable, with = F]), d$livesalone)
     p <- round(prop.table(a, 2) * 100, 0)
     matrix(paste0(a, ' (', p, ')'), ncol = 2, dimnames = list(row.names(a), c('Live with others', 'Lives alone')))
    }
    lapply(c('age_grp', 'sex', 'LTC_grp'), describe.by.isolation)
     
    # describe healthcare costs
    d[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
     
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(d$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(d$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    People who live alone are older, more likely to be female and have more long-term health problems. Their mean healthcare costs are £2,347 higher. The difference in healthcare costs is visible on the histograms, which we have displayed on a log scale because some values are extremely high. There’s some ‘zero inflation’ in both groups – people with no healthcare costs who do not fit into the otherwise lognormal-ish distribution. So far not too surprising – but are the increased healthcare costs explained by the differences in age, sex and health?

    Regression

    One approach would be to use regression. We could fit a linear model – this is actually not a great fit for patients with higher healthcare costs, but we won’t go into that here.

    linear_model <- lm(cost ~ livesalone + age_grp + sex + LTC_grp, d)
    plot(linear_model) # diagnostics show that a linear model is not a great fit. You might have to press return to see all the plots before you can continue.
    summary(linear_model)

    The results suggest that health and care costs for people who live alone are £892 more than those who do not live alone, on average. Clearly the variables we added to the model are important confounders, as this is much lower than the ‘crude’ difference of £2,347. Two limitations of this approach are that we will have to think quite carefully about the fit of the model to the data, and that we can’t describe how living alone affects the distribution of costs.

    Exact matching

    We therefore used exact matching, in which each individual who lives alone is matched to an individual who does not live alone, based on specified variables such as age group and sex. We developed a function that does this by stratifying the data and then matching ‘cases’ and ‘controls’ randomly. Unmatched individuals are deleted from the dataset, leaving a dataset that is balanced in terms of the variables you specify. Let me know here if there’s any other functionality you want and we can try and incorporate it.

    Let’s try matching on age group, sex and the grouped number of long-term conditions:

    # data = dataset containing:
    # - treatment/exposure variable 'mvar' (a string specifying variable name).
    # - matching variable 'mvar' (a string specifying variable name). If you want to match on multiple variables, concatenate them first.
    # other inputs are:
    # - ratio of cases:controls (an integer > 0)
    # - seed for fixing random selection of cases/controls (an integer; default NULL means no seed). Choice of seed is arbitrary.
    # returns data.table of matched observations, with additional variable 'id' for use in paired/grouped analyses
     
    smatch <- function (data, treat, mvar, ratio = 1, seed = NULL) {
     setnames(data, mvar, '.mvar')
     targ <- data[, .(case = sum(get(treat)), control = sum(!get(treat))), .mvar]
     targ[, cst := floor(pmin(control / ratio, case))]
     targ[, cnt := cst * ratio]
     targ <- targ[cst > 0]
     l2 <- cumsum(targ$cst)
     ids <- mapply(':', c(0, l2[-nrow(targ)]), l2-1)
     names(ids) <- targ$.mvar
     case <- NULL
     control <- NULL
     x <- .Random.seed
     set.seed(seed)
     on.exit({.Random.seed <- x})
     for(i in targ$.mvar) {
       case[[i]] <- data[get(treat) == T & .mvar == i][sample(.N, targ$cst[targ$.mvar == i])]
       case[[i]][, id := ids[[i]]]
       control[[i]] <- data[get(treat) == F & .mvar == i][sample(.N, targ$cnt[targ$.mvar == i])]
       control[[i]][, id := rep(ids[[i]], each = ratio)]
      }
     rbindlist(c(case, control))
    }
     
    # create a single variable summarising matching variables
    d$mvar <- do.call('paste0', d[, c('age_grp', 'sex', 'LTC_grp')])
     
    # create 1:1 matched dataset.
    matched_data <- smatch(d, treat = 'livesalone', mvar = 'mvar', ratio = 1, seed = 74)
     
    # check balance: same number of individuals in each group
    dcast(matched_data, age_grp + sex + LTC_grp ~ livesalone, value.var = 'id', fun.aggregate = length)

    Now we have a dataset that is balanced in terms of age, sex and the count of long-term conditions. Let’s see how healthcare costs compare:

    matched_data[, .(mean_cost = mean(cost), sd = sd(cost), med_cost = median(cost), iqr = IQR(cost)), livesalone]
     
    # histograms of cost
    par(mfrow = c(2, 1), mar = c(3, 3, 2, 0))
    hist(log(matched_data$cost[d$livesalone] + 1), main = 'Lives alone', xlab = 'Log cost + 1', col = 'red', xlim = c(0, 14))
    hist(log(matched_data$cost[!d$livesalone] + 1), main = 'Does not live alone', xlab = 'Log cost + 1', col = 'green', xlim = c(0, 14))

    # t-test (in reality you might want a paired test, and to check whether a t-test is appropriate)
    t.test(cost ~ livesalone, matched_data) # notice how wide the confidence intervals are for this reduced dataset

    # proportion with costs over £10000
    matched_data[, .(over10k = sum(cost > 10000) / .N), livesalone]

    The mean difference is £803. When we used the whole dataset, this value was even closer to the coefficient from linear regression. It’s now difficult to see a difference in the histograms, but you can easily create any description of the distribution that you like – e.g. the proportion of patients that have costs over £10,000.

    Who got matched and who didn’t?

    The point of matching was to create a comparison group of people who don’t live alone who were in some ways similar to the group who do. We probably had to delete lots of people who don’t live alone in a systematic way (e.g. men and younger people who do not live alone were more likely to be deleted). We might also have deleted some of the group who do live alone, which could be more problematic if we want to generalise our results to the population. Let’s see who got deleted…

     d[, matched := ifelse(PID %in% matched_data$PID, 'matched', 'unmatched')]
     
    # just looking at age group for now
    compare_matched <- dcast(d, age_grp ~ livesalone + matched, value.var = 'PID')
    compare_matched[, 'TRUE_total' := TRUE_matched + TRUE_unmatched]
    compare_matched[, lapply(.SD, function(x) x/sum(x) * 100), .SDcols = 2:6]

    You can see that some of the ‘lives alone’ group got deleted (‘TRUE_unmatched’), and they were all in the older age groups. The difference between everyone who lives alone (‘TRUE_total’) and the matched group (‘TRUE_matched’) is diluted, because the oldest groups are a relatively small part of the data. Nonetheless, I would say this is a fairly important difference. If you are concerned about generalisability to the population, you might want to restrict the analysis to people aged under 85. In the full dataset this was not be a problem (as there were lots more potential ‘controls’), but you might encounter similar issues if you match on more detailed variables.

    Final point! We’ve focused on a technical solution to exact matching. We haven’t really thought about which variables we should be matching on. This is a matter of judgement, and needs a bit of thinking before we dive into the matching process. Just like covariates in regression, the matching variables are confounders – they’re alternative explanations for an association between living alone and healthcare costs. Age and sex are clearly important. Health status is more complex. Do you want to think about long-term conditions as something that cause people to live alone, or the product of living alone? If people are sicker because they live alone, matching on the number of long term conditions might falsely reduce the association between living alone and healthcare costs.

    This blog was written by Dan Lewer, NIHR Doctoral Research Fellow / SpR Public Health. Department of Epidemiology & Public Health, UCL

    Twitter: danlewer

  • Count of working days function

    It’s at this time of year I need to renew my season ticket and I usually get one for the year. Out of interest, I wanted to find out how much the ticket cost per day, taking into account I don’t use it on weekends or my paid holidays. I started my workings out initially in Excel but got as far as typing the formula =WORKDAYS() before I realised it was going to take some working out and perhaps I should give it a go in R as a function…

    @ChrisBeeley had recently shown me functions in R and I was surprised how familiar they were as I’ve seen them on Stack Overflow (usually skimmed over those) and they are similar to functions in SQL which I’ve used (not written) where you feed in parameters. When I write code I try to work out how each part works and build it up but writing a function requires running the whole thing and then checking the result, the objects that are created in the function do not materialise so are never available to check. Not having objects building up in the environment console is one of the benefits of using a function, that and not repeating scripts which then ALL need updating if something changes.

    Bus ticket function

    This is the final function which if you run you’ll see just creates a function.

    #Week starts on Sunday (1)
    DailyBusFare_function <- function(StartDate,EmployHoliday,Cost,wfh){
      startDate <- dmy(StartDate)
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <-seq(from = startDate, to = endDate, by = "days")
      working_days <- sum(wday(myDates)>1&wday(myDates)<7)-length(holidayLONDON(year = lubridate::year(startDate))) - EmployHoliday - wfh
    per_day <- Cost/working_days
    print(per_day)
    }

    Running the function you feed in parameters which don’t create their own objects:

     DailyBusFare_function("11/07/2019",27,612,1) 

    [1] 2.707965

    Going through each line:

    To make sure each part within the function works it’s best to write it in another script or move the bit betweeen the curly brackets {}.

    Firstly, the startDate is self explanatory but within the function I’ve set the endDate to be dependent upon the startDate and be automatically 1 year later.

    Originally when I was trying to find the “year after” a date I found some documentation about lubridate’s function dyear:

    #Next couple of lines needed to run the endDate line!
    library(lubridate)
    startDate <- dmy("11/07/2019")
    endDate <- startDate + dyears(1)

    but this gives an exact year after a given date and doesn’t take into account leap years. I only remember this because 2020 will be a leap year so the end date I got was a day out!

    Instead, @ChrisBeeley suggested the following:

    endDate <- as.Date(startDate) %m+% months(12)

    Next, the code builds a sequence of days. I got this idea of building up the days from the blogs on getting days between two dates but it has also come in use when plotting over time in things like SPCs when some of the time periods are not in the dataset but would make sense appearing as 0 count.

    library(lubridate)

    #To run so that the sequencing works
    #using as.Date() returns incorrect date formats 0011-07-20 so use dmy from
    #lubridate to transform the date
      startDate <- dmy("11/07/2019")
      endDate <- as.Date(startDate) %m+% months(12)
    #Now build a sequence between the dates:
      myDates <- seq(from = startDate, to = endDate, by = "days")

    All of these return values so trying to open them from the Global Environment won’t do anything. It is possible view the first parts of the values but also typing:

    #compactly displays the structure of object, including the format (date in this case)
    str(myDates)

    Date[1:367], format: “2019-07-11” “2019-07-12” “2019-07-13” “2019-07-14” “2019-07-15” …

    #gives a summary of the structure
    summary(myDates)
    Min.      1st Qu.       Median         Mean      3rd Qu.

    “2019-07-11” “2019-10-10” “2020-01-10” “2020-01-10” “2020-04-10” Max. “2020-07-11”

    To find out what a function does type ?str or ?summary in the console. The help file will then appear in the bottom right Help screen.

    Next I worked out working_days. I got the idea from a blog which said to use length and which:

    working_days <- length(which((wday(myDates)>1&wday(myDates)<7)))

    Note that the value appears as 262L which is a count of a logical vector. Typing ?logical into the Console gives this in the Help:

    Logical vectors are coerced to integer vectors in contexts where a numerical value is required, with TRUE being mapped to 1L, FALSE to 0L and NA to NA_integer._

    I was familiar with “length”, it does a count essentially of factors or vectors (type ?length into the Console for information) but “which” wasn’t something I knew about. @ChrisBeeley explained what which does with the following example:

    #Generate a list of random logical values
    a <- sample(c(TRUE, FALSE), 10, replace = TRUE)
    #Look at list
    a

    [1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE

    #using which against the list
    gives the number in the list where the logic = TRUE
    which(a)

    [1] 4 5 8

    #counts how many logical
    arguments in the list (should be 10)
    length(a)

    [1] 10

    #counts the number of TRUE
    logical arguments
    length(which(a))

    [1] 3

    Then @ChrisBeeley suggested just using “sum” instead of length(which()) which counts a logical vector:

    sum(a)

    [1] 3

    It seems this has been discussed on Stack Overflow before: https://stack overflow.com/questions/2190756/how-to-count-true-values-in-a-logical-vector

    It’s worthy of note that using sum will also count NAs so the example on Stack Overflow suggest adding:

    sum(a, na.rm = TRUE)

    [1] 3

    This won’t affect the objects created in this blog as there were no NAs in them but it’s just something that could cause a problem if used in a different context.

    working_days <- sum(wday(myDates)>1&wday(myDates)<7)
    #Just to check adding na.rm = TRUE gives the same result
      working_days <- sum(wday(myDates)>1&wday(myDates)<7, na.rm = TRUE)

    I then wanted to take into account bank/public holidays as I wouldn’t use the ticket on those days so I used the function holidayLONDON from the package timeDate:

    length(holidayLONDON(year = lubridate::year(startDate)))

    [1] 8

    I used lubridate::year because the package timeDate uses a parameter called year so the code would read year = year(startDate) which is confusing to me let alone the function!

    Again, I counted the vectors using length. This is a crude way of getting bank/public holidays as it is looking at a calendar year and not a period (July to July in this case).

    I did look at a package called bizdays but whilst that seemed to be good for building a calendar I couldn’t work out how to make it work so I just stuck with the timeDate package. I think as I get more confident in R it might be something I could look into the actual code for because all packages are open source and available to view through CRAN or GitHub.

    Back to the function…

    I then added “- EmployHoliday” so I could reduce the days by my paid holidays and “- wfh” to take into account days I’ve worked from home and therefore not travelled into work.

    The next part of the code takes the entered Cost and divides by the Working_days, printing the output to the screen:

    per_day <- Cost/working_days

    print(per_day)

    And so the answer to the cost per day is printed in the Console:

    DailyBusFare_function("11/07/2019",27,612,1)

    [1] 2.707965

    A conclusion… of sorts

    Whilst this isn’t really related to the NHS it’s been useful to go through the process of producing a function to solve a problem and then to explain it, line by line, for the benefit of others.

    I’d recommend doing this to further your knowledge of R at whatever level you are and particularly if you are just learning or consider yourself a novice as sometimes blogs don’t always detail the reasons why things were done (or why they were not done because it all went wrong!).

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @AppliedInfoNott and @Letxuga007

    The R Markdown file is available here:
    https://github.com/nhs-r-community/NHSR_blogs/tree/master/Count_of_working_days_ZTurner_201907

  • Format ONS spreadsheet

    Background

    A Public Health consultant colleague Ian Bowns (@IantheBee) created a report to monitor mortality within the Trust and he used the ONS weekly provisional data for the East Midlands to compare the pattern and trends of deaths over time. This involves downloading a file from:

    https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales

    which is updated weekly. Once a month I, manually, add numbers from this to another spreadsheet to be imported to R for the overall analysis.

    Downloaded file formats

    You may be familiar with ONS and other NHS data spreadsheets format and if you are not, here are some of the issues:

    • data is presented in wide form, not long form (so columns are rows and vice versa)
    • the sheets are formatted for look rather than function with numerous blank rows and blank columns
    • there are multiple sheets with information about methodology usually coming first. This means a default upload to programmes like R are not possible as they pick up the first sheet/tab
    • the file name changes each week and includes the date which means any code to pick up a file needs to be changed accordingly for each load
    • being Excel, when this is imported into R, there can be problems with the date formats. These can get lost to the Excel Serial number and
    • they include a lot of information and often only need a fraction of it

    Given these difficulties there is great temptation, as happened with this, to just copy and paste what you need. This isn’t ideal for the reasons:

    • it increases the likelihood of data entry input error
    • it takes time and
    • it is just so very very tedious

    The solution is, always, to automate and tools like Power Pivot in Excel or SSIS could work but as the final report is in R it makes sense to tackle this formatting in R and this is the result…

    Import file

    For this you can either save the file manually or use the following code within R. Save it to the same place where the code is running and you should see the files in the bottom right window under the tab ‘Files’. The best way to do this is using project and opening up the script within that project.

    download.file("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2019/publishedweek082019.xls",
                  destfile = "DeathsDownload.xls",
                  method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
                  mode = "wb") #wb means "write binary"

    Not that this file’s name and URL changes each week so the code needs changing each time it is run.

    Once the file is saved use readxl to import which means the file doesn’t need its format changing from the original .xls

    When I upload this file I get warnings which are related, I think, to the Excel serial numbers appearing where dates are expected.

    • sheet = :refers to the sheet I want. I think this has to be numeric and doesn’t use the tab’s title.
    • skip = :is the number of top rows to ignore.
    DeathsImport <- read_excel("DeathsDownload.xls ", 
                               sheet = 4,
                               skip = 2)
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in C5 / R5C3: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in F5 / R5C6: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in G5 / R5C7: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in H5 / R5C8: got a date
    ## Warning in read_fun(path = enc2native(normalizePath(path)), sheet_i =
    ## sheet, : Expecting numeric in I5 / R5C9: got a date
    ## New names:
    ## * `` -> `..2`

    Formatting the data

    The next code creates a list that is used in the later code that is similar to the SQL IN but without typing out the list within the code for example:

    • SQL : WHERE city IN (‘Paris’,‘London’,‘Hull’)
    • R : filter(week_number %in% filter)

    These lines of code are base R code and so don’t rely on any packages.

    LookupList <- c("Week ended",
                "Total deaths, all ages",
                "Total deaths: average of corresponding",
                "E12000004"
    )

    The next bit uses the dplyr package, which has loaded as part of tidyverse, as well as the janitor package. Not all packages are compatible with tidyverse but many do as this is often the go-to data manipulation package.

    As an aside the %>% is called a pipe and the shortcut is Shift + Ctrl + m. Worth learning as you’ll be typing a lot more if you type out those pipes each time.

    Janitor commands

    • Clean names: removes spaces in column headers and replaces with _
    • remove_empty: gets rid of rows and columns – this dataset has a lot of those!

    Dplyr command

    • filter: is looking just for the rows with the words from the list ‘LookupList’. These will become the column names later.
    DeathsImport2 <- DeathsImport %>% 
      clean_names %>% 
      remove_empty(c("rows","cols")) %>% 
      filter(week_number %in% LookupList) 

    There are great commands called gather and spread which can be used to move wide form data to long and vice versa but with this I noticed that I just needed to turn it on its side so I used t() which is also useful as it turns the data frame to a matrix. You can see this by looking in the ‘Environment’ window in the top right of R Studio; there is no blue circle with an arrow next to t_DeathsImport.

    t_DeathsImport <- t(DeathsImport2)

    Being a matrix is useful as the next line of code makes the first row into column headers and this only works on a matrix.

    colnames(t_DeathsImport) <- t_DeathsImport[1, ]

    Dplyr gives an error on matrices:

    Code:

    t_DeathsImport %>% mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(Week ended)), date_system = “modern”))

    Result:

    Error in UseMethod(“mutate_”) : no applicable method for ‘mutate_’ applied to an object of class “c(‘matrix’, ‘character’)”

    As later code will need dplyr turn the matrix into a dataframe using some base R code:

    t_DeathsImport <- as.data.frame(t_DeathsImport)

    Previous dplyr code filtered on an %in% bit of code and it’s natural to want a %not in% but it doesn’t exist! However, cleverer minds have worked out a function:

    '%!ni%' <- function(x,y)!('%in%'(x,y))

    The text between the ‘’ can be anything but I like’%ni%’ as it’s reminiscent of Monty Python.

    Because of the moving around of rows to columns the data frame now has a row of column names which is not necessary as well as a row with just ‘East Midlands’ in one of the columns so the following ‘remove’ list is a workaround to get rid of these two lines.

    remove <- c("E12000004", "East Midlands")

    The next code uses the above list followed by a mutate which is followed by a janitor command ‘excel_numeric_to_date’. This tells it like it is but, as often happens, the data needs to be changed to a character and then to numeric. The date system = “modern” isn’t needed for this data but as I took this from the internet and it worked, so I left it.

    An error will appear about NAs (nulls).

    t_DeathsImport <- t_DeathsImport %>% 
      filter(E12000004 %!ni% remove) %>% 
      mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern"))
    ## Warning in excel_numeric_to_date(as.numeric(as.character(`Week ended`)), :
    ## NAs introduced by coercion

    Now to deal with this mixing of real dates with Excel serial numbers.

    Firstly, the following code uses base R to confirm real dates are real dates which conveniently wipes the serial numbers and makes them NAs.

    t_DeathsImport$`Week ended` <- as.Date(t_DeathsImport$`Week ended`, format = '%Y-%m-%d')

    This results in two columns:

    • Week ended which starts off with NAs then becomes real dates and
    • serialdate which starts off with real dates and then NAs.

    The human eye and brain can see that these two follow on from each other and just, somehow, need to be squished together and the code to do it is as follows:

    t_DeathsImport <- t_DeathsImport %>% 
      mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    To translate the mutate, this creates a new column called date which, if the Week ended is null then takes the serial date, otherwise it takes the Week ended.

    Interestingly if ‘ifelse’ without the underscore is used it converts the dates to integers and these are not the same as the Excel serial numbers so use ‘if_else’!

    And that’s it.

    Or is it?

    You might want to spit out the data frame back into excel and that’s where a different package called openxlsx can help. As with many things with R, “other packages are available”.

    write.xlsx(DeathsImport, 'ImportProvisionalDeaths.xlsx')

    If you haven’t used a project (which is really the best way to work) this will probably save in some obscure C: drive that you’ll see in the bottom left ‘Console’ just under the tab names for ‘Console’ and ‘Terminal’. Using projects means you set the pathway and that will mean the file saves in the same place and will also appear in the bottom right panel under ‘Files’.

    Feedback

    I’m pretty early on in my journey in R and many of my colleagues still haven’t started yet so I’m throwing this out there so everyone can see it, newbies and old hands alike. If you spot anything, can explain anything further, need more explanation or can offer any alternatives to what I’ve done please please feel free to comment.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

  • NHS open source public datasets – creating realistic synthetic datasets

    This blog is an initial attempt to garner interest in a project to create NHS synthetic datasets in a range of fields and also to understand the underlying principles around creating synthetic healthcare data and its intricacies.

    Healthcare data is increasingly electronic. With the huge datasets that the NHS is in the process of collecting for patient care comes an impetus to standardise data collection between hospitals and trusts. This is good for patient care not least because it allows for standardised analysis of large datasets. This standardisation is in progress across many of the medical specialities but several have already reached a mature stage. A good example is the National Endoscopy Database which aims to automatically collect data from all endoscopic procedures in the UK according to a standardised template. This will allow for the analysis of variation in endoscopic performance, quality and outcomes amongst many other outputs.

    The analysis of these datasets will need to be validated, reproducible and of course progressive as the desired metrics change. The analyses will therefore require two things

    1. Ongoing input from analysts to maintain the methodology and code base to perform the analysis.
    2. Creative ideas for the representation of the datasets.

    NHS datasets and the methods for their manipulation are likely to attract a lot of interest from diverse sources such as pharmaceutical companies to healthcare software developers and academic researchers. By restricting access to the datasets because of privacy issues we also, by necessity restrict the speed at which solutions can be found using these datasets as analysis will only be carried out by a small sample of authorised analysts

    The obvious solution here is to create NHS datasets which are constructed according to accepted and used data templates, but to populate them with synthetic data and then to allow open source access to these datasets.

    Synthetic data is not always easy to create. The vast majority of NHS electronic data is still semi-structured free text. Reports also have to make sense internally so that, for example, an endoscopy report describing a stomach ulcer has to contain text that is relevant to the ulcer finding. It gets even more complex when further reports are written that reference a report in another dataset. An example is the histopathology report from a biopsy taken from the stomach ulcer. This biopsy report will obviously have to be reporting on the stomach ulcer and the text will be about pathology findings relevant to the description of an ulcer.

    An example of an attempt at creating a synthetic medical dataset can be found here using the above example (https://github.com/sebastiz/FakeEndoReports). This contains some description of how the reports are created and I have tried to derive some principles regarding how to make fake healthcare datasets in general based on this example.

    There are of course many other datasets that would be incredibly powerful if they were created an open sourced. One such datasets is the NHS patient administration system on which most statistics about waiting times and patient pathways are based. Another is the Hospital Episode Statistics (HES) which collect information regarding all NHS appointments (in-patient and outpatient) and which is being used to create data linkage between a wide range of data repositories (My attempt at creating synthetic HES data can be found here but is still incomplete at the moment: https://github.com/sebastiz/HesMineR)

    R is the perfect language to create such synthetic datasets and it would be a valuable addition to the NHS-R armamentarium to have a package that contained synthetic NHS datasets so that open source solutions can be more quickly and creatively derived.