• Dygraphs

    Dygraphs for mortality surveillance

    I recently presented some of the mortality surveillance charts we use to @RLadiesLondon (a very welcoming group!) and one that got some interest was a chart of Nottinghamshire Healthcare NHS Foundation Trust deaths compared to ONS Provisionally Registered deaths. The chart looks good because it’s interactive but this type of chart can be confusing because of the 2 y axes.

    When I show this report I make it clear that the two axes units are very different and that its purpose is to show that the pattern of the deaths in the wider population matches that of the deaths recorded by the Trust. It’s well known within Public Health that the pattern of deaths is seasonal, with a peak around January in the UK. However, this Public Health knowledge is not necessarily common knowledge in Secondary Care Trusts and it was one of the great benefits of having @IantheBee both create and present this report.

    Getting ONS Provisional Data

    I wrote about getting and formatting the spreadsheets from ONS for the East Midlands Provisionally Registered deaths:

    but for the purposes of the mortality surveillance report I’ve used several years data and I normally keep the spreadsheets, save the merged data and then load that each time I need to run the R markdown report.

    For the purposes of this blog I’ve amended the formatting code by adding functions so each year can be transformed and is available to plot:


    # Download ONS spreadsheets Function-----------------------------------------------

    GetData_function <- function(www,file){
    destfile = file,
    method = "wininet", #use "curl" for OS X / Linux, "wininet" for Windows
    mode = "wb") #wb means "write binary" }




    #2016 GetData_function("https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales/2016/publishedweek522016.xls",

    # Import correct sheets --------------------------------------------------- Deaths_Now <- read_excel("DeathsDownload2019.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2018 <- read_excel("DeathsDownload2018.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2017 <- read_excel("DeathsDownload2017.xls ",
    sheet = 4,
    skip = 2)
    Deaths_2016 <- read_excel("DeathsDownload2016.xls ",
    sheet = 4,
    skip = 2)

    # Look up code to remove excess rows -------------------------------------- LookupList <- c("Week ended",
    "Total deaths, all ages",
    "Total deaths: average of corresponding",
    "E12000004" )

    # Function for formatting data -------------------------------------------- Transform_function <- function(dataframe){

    #Format data frames
    df <- dataframe %>%
    clean_names %>%
    remove_empty(c("rows","cols")) %>%
    filter(week_number %in% LookupList)

    #Transpose table
    df <- t(df)

    #Whilst this is a matrix make the top row the header
    colnames(df) <- df[1, ]

    #Make this object a data.frame
    df <- as.data.frame(df)

    #Function to find 'not in'
    '%!ni%' <- function(x,y)!('%in%'(x,y))

    #List of things to remove to tidy the data.frame
    remove <- c("E12000004", "East Midlands")

    #remove the rows and ensure dates are in the correct format
    df <- df %>%
    filter(E12000004 %!ni% remove) %>%
    mutate(serialdate = excel_numeric_to_date(as.numeric(as.character(`Week ended`)), date_system = "modern")) df$`Week ended` <- as.Date(df$`Week ended`, format = '%Y-%m-%d') df <- df %>% mutate(date = if_else(is.na(`Week ended`),serialdate,`Week ended`))

    #Final transformation of data
    df %>% select(`Total deaths, all ages`,date) %>%
    filter(!is.na(`Total deaths, all ages`)) %>%
    mutate(`Total deaths, all ages` = as.numeric(as.character(`Total deaths, all ages`))) #To match other data.frames }

    # Run functions -----------------------------------------------------------
    DeathsNow <- Transform_function(Deaths_Now)
    Deaths2018 <- Transform_function(Deaths_2018)
    Deaths2017 <- Transform_function(Deaths_2017)
    Deaths2016 <- Transform_function(Deaths_2016)

    # Merge data -----------------------------------------------------
    Deaths <- bind_rows(DeathsNow,
    Deaths2016) %>%
    mutate(date = as.Date(date),
    `Total deaths, all ages` = as.numeric(`Total deaths, all ages`))

    This code may give a few warnings saying that NAs have been introduced by coercion which is because there are many cells in the spreadsheets that have no data in them at all. It’s a good thing they have nothing (and effectively NAs) as having 0s could confuse analysis as it isn’t clear if the 0 is a real 0 or missing data.

    To suppress warnings in R Markdown add warning=FALSE to the header, however, I like to keep the warnings just in case.

    If you want to keep all the data after merging it together use:

    library(openxlsx) # To write to xls if required. 

    #Export complete list to excel for future
    write.xlsx(Deaths, 'ImportProvisionalDeaths.xlsx')

    If you’ve saved the combined file, to call it again in a script use the following code:

    library(readxl) Deaths <- read_excel("ImportProvisionalDeaths.xlsx") 

    Dygraph chart

    The following data is randomly generated as an example:


    #Fix the randomly generated numbers set.seed(178)

    Alldeaths <- Deaths %>%
    select(date) %>%
    mutate(n = rnorm(n(), mean=150))

    Merge the two data frames together:

    ONSAlldeaths <- Deaths %>%    
    left_join(Alldeaths, by = "date") %>%
    mutate(ds = as.POSIXct(date)) %>%
    select(ds, y2 = n, y = `Total deaths, all ages`) %>%

    Dygraphs require dates to be in a time series format and the package xts can convert it:

    ONSAlldeaths_series <- xts(ONSAlldeaths, order.by = ONSAlldeaths$ds) 

    The date column is no longer needed so can be removed but this needs to be done using base R and not dplyr:

    #Remove duplicate date column ONSAlldeaths_series <- 
    ONSAlldeaths_series[, -1]

    Finally, the xts can be plotted:

    dygraph(ONSAlldeaths_series, main = "East Midlands Weekly Deaths/Randomly generated numbers") %>%   
    dySeries("y", axis = "y", label = "East Midlands") %>%
    dySeries("y2", axis = "y2", label = "Random numbers") %>%
    dyAxis("x", drawGrid = FALSE) %>%
    dyAxis("y", drawGrid = FALSE, label = "East Midlands") %>%
    dyAxis("y2", independentTicks = TRUE, drawGrid = FALSE, label = "Random numbers") %>%
    dyOptions(stepPlot = TRUE) %>%

    Note this chart is not interactive in this blog.

    When you’ve plotted the chart if you wave the cursor over the points you will see information about those points, you are also able to zoom in by using the scrolling bar at the bottom of the chart (this was from the dyRangeSelector() code.

    Other options are detailed here: https://rstudio.github.io/dygraphs/index.html

    ONS Provisional data

    One of the things that may stand out in the chart for the are the big dips around 29-31 December time each year and we presume that these relate to the week where Registrations may be delayed from GP practices to ONS because of the public holidays around Christmas.

    Unfortunately, only provisionally recorded deaths are available by week as confirmed are monthly: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/monthlyfiguresondeathsregisteredbyareaofusualresidence

    Note this chart is not interactive in this blog.

    The case of the mysteriously disappearing interactive graph in R Markdown html output

    I’d rendered (knit or run) the html reports with the interactive graphs and it had all worked so I emailed the report to people as promised and then the emails came back: “Some of the graphs are missing, can you resend?”. Perplexed, I opened the saved file from the server and, yes, indeed some of the charts had disappeared! Where there should be lovely interactive charts were vast swathes of blank screen. What had happened? The code ran fine, looked fine and how do you even google this mystery?

    Turns out my default browser, and I suspect it is throughout most of the NHS because lots of NHS systems depend on it, is Microsoft Explorer. Whilst I have the latest version these reports have never opened properly in Explorer.

    The solution: Chrome (or some other browser). I ask people to copy the link from the Explorer web address bar after opening it from the email and simply paste it to Chrome.

    This blog was written by Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust. 

    @AppliedInfoNott and @Letxuga007

  • SQL Server Database connections in R

    Getting data into R

    One of the things I found most difficult when learning R was getting data from our SQL Servers into R to analyse. It is easy to load csv files, or pull example datasets from packages, but a little more knowledge is required to connect to external databases. I think this is a common problem for my NHS colleagues when learning R and probably for others too. This post is a brief introduction to the two main ways to pull data in to R using RODBC and using dplyr‘s ’DBI-compliant’ database connections.
    I’ll be talking about connections with Microsoft SQL Server (over a local network), but this can also extend to other types of database by using different drivers, or other network set-ups using the right authentication.

    Where is the data stored?

    If you are using SQL Server to store your data, it is written into files on the database server and accessed using SQL scripts in your chosen interface (I use SQL Server Management Studios). Whilst the data are stored on disk on the SQL Server, R stores data in memory (RAM) on your machine. This has the advantage of quick access, but you can easily run out of memory with larger datasets, or processes that build larger matrices (like some types of statistical model). If memory is an issue, you will probably get the error message: Error: cannot allocate vector of size .... If you hit this situation, I’d recommend trying the data.table as an alternative to using data.frames. It is much faster, and has a lower memory footprint. Here’s a great blog post about it from Megan Stodel: https://www.meganstodel.com/posts/data-table/

    Two common methods

    There are two common methods of connection, both of which use Open Database Connectivity (ODBC) drivers:

    1. The RODBC package.
    2. The DBI system, using dplyr, dbplyr and odbc.

    Both of these create a connection, using a ‘connection string’ with the details of server/credentials etc., this can be used to create a connection object, from which we can pull data into R or manipulate it in the database.

    A note on security

    To prevent publishing our server or database names in this post, I’ve used an alias that goes and looks them up internally. There are a few options for doing this in your own code, but I’ve added them to my ‘.Renviron’ file, as SERVER and DATABASE variables. My code looks them up each time using the Sys.getenv function that you’ll see in the examples below.

    This has meant using the paste function to concatenate the variables together in the ROBDC example. You won’t have to do this in your own work if you replace the server and database names directly.

    1. RODBC

    This is the simpler of the two interfaces, and uses slightly older code. It can be used to connect to anything that uses Open Database Connectivity (ODBC). I’ll define a connection string to a database server, a database, and a table called ‘MyTable’ that has some dummy data in it. If you haven’t got any of the packages used in this post, you can install them with: install.packages("RODBC") for example.

    #Connection string 
    RODBC_connection <- odbcDriverConnect(paste('driver={SQL 
                                          ';trusted_connection=true', sep = "")) 
    # e.g. with a server called "Cliff" and a database called "Richard" your string would be: 
    # driver={SQL Server};server=Cliff;database=Richard;trusted_connection=true')                            
    dt1 <- sqlFetch(channel=RODBC_connection, sqtable = "MyTable") 
    # Load data from SQL query 
    dt2 <- sqlQuery(channel=RODBC_connection, query = "select TOP 100 * from MyTable") 

    Quite straightforward to use! In the example above, I specified trusted_connection=true. In a windows environment, this passes your windows credentials to the server. Since we use these for access permissions on our SQL Servers, we can use this method with no issues. You can, instead, specify a username (uid) and a password (pwd): see the help files for more details, using: ?odbcDriverConnect.

    You can also use RODBC to write back to database tables, choosing whether or not to append your results using the append and safer arguments. Not appending means you will overwrite the table:

    sqlSave(channel = RODBC_connection, dat = dt2, tablename = "Mytable_version2", append = FALSE, safer = FALSE) 

    There are lots of other functions included with RODBC to allow you to see structures etc. The package vignette is a very helpful place to go for this, along with the help files.

    Remember to disconnect at the end of your session:


    If you do this a lot, you might find Gary Hutson’s recent post, showing how to wrap some of this into a function, a useful addition. Check it out here: http://hutsons-hacks.info/rodbc-helper-function.

    2. DBI  dplyr

    The RODBC interface was simple, quick, and you may not need to consider another approach, but I prefer to use the tidyverse functions linked to dplyr. These functions are maturing in the last couple of years, and have a few major advantages:

    • Work with tidyverse functions, including dplyr verbs and the pipe %>%
    • Faster than RODBC to import data
    • Can be used to work with data in the database, without importing it into R.

    The connection string is slightly different, and we require a few more packages to make this work. You need to make sure you have the following installed:

    • dplyr – to make the tbl and use it, we’ll work with dplyr syntax.
    • DBI – a common Database Interface engine for use in S and R (see: https://cran.r-project.org/web/packages/DBI/vignettes/DBI-1.html)
    • dbplyr – this add-on package allows translation from dplyr to SQL.
    • odbc– provides the odbc drivers, but you could use the functions below with other drivers instead.

    DBI_Connection <- dbConnect(odbc(),
    driver = "SQL Server",

    Now we can define a table as if it was part of our R workspace, using the connection object and the names of the table in the database. We can then interact with it directly using dplyrglimpse is a useful function that shows you the column names, datatypes and top few entries:

    MyTable<-tbl(DBI_Connection, "MyTable") 

    ## Observations: ??
    ## Variables: 7
    ## Database: Microsoft SQL Server
    ## $ id <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
    ## $ Org <chr> "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "...
    ## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201...
    ## $ month <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
    ## $ Category_1 <dbl> 35395, 21126, 9248, 4049, 5323, 16867, 9916, 12404,...
    ## $ Cateogry_2 <dbl> 39293, 24860, 11031, 5812, 6876, 18475, 12976, 1657...
    ## $ events <int> 1986, 429, 460, 301, 289, 1172, 446, 756, 663, 874,...

    MyTable %>%
    filter(year ==2015) %>%
    group_by(month) %>%
    summarise(AvgEvents = mean(events),
    MaxEvents = max(events),
    N = n()) %>%
    ## # Source: lazy query [?? x 4]
    ## # Database: Microsoft SQL Server
    ## # Ordered by: month ## month AvgEvents MaxEvents N
    ## <int> <int> <int> <int>
    ## 1 4 624 1986 25
    ## 2 5 658 1948 25
    ## 3 6 671 2068 25
    ## 4 7 669 1879 25
    ## 5 8 630 1981 25
    ## 6 9 649 2011 25

    dplyr can then be used to do fairly complex things in just a few lines. The example below is not very well thought-out, but it takes data from the database into a summary plot in just a few lines. I’m filtering the data for 2015 and passing it directly into ggplot2. I’ve set a few options for a box plot, but it’s quite minimal, and our data has remained in the database and not been imported to R.


    MyTable %>%
    filter(year ==2015) %>%
    ggplot(aes(y=events, x=factor(month), group=factor(month))) +
    geom_boxplot(fill = "dodgerblue2", alpha=0.6, )+
    labs(title = "Monthly Distribution of Events", x="Month", y="Events")

    You can, of course, write an SQL query directly using the dbSendQuery function. This executes the query on the server-side only, so if you want the results to be returned back to R, you need to use dbFetch as well. You might need this approach if you are doing fancy queries, or things that are specific to a database environment that don’t yet have translators in dplyr.

    SomeRecords <- dbFetch(dbSendQuery(DBI_Connection, "Select TOP 100 * from MyTable")) 


    SomeRecords <- dbSendQuery(DBI_Connection, "Select TOP 100 * from MyTable") %>%

    You may not need to write a custom query for everyday use, but you are still likely to need to pull the data from the server into memory in R sometimes. For me, this is often to build models on it, as that isn’t supported in-database. You can use the collect function for this. For example, using part of the query from earlier as an example:

    MyTable_local<- MyTable %>%   
    filter(year ==2015) %>%
    group_by(month) %>%
    summarise(AvgEvents = mean(events),
    MaxEvents = max(events),
    N = n()) %>%
    arrange(month) %>%

    ## # A tibble: 6 x 4 
    ## month AvgEvents MaxEvents N
    ## <int> <int> <int> <int>
    ## 1 4 624 1986 25
    ## 2 5 658 1948 25
    ## 3 6 671 2068 25
    ## 4 7 669 1879 25
    ## 5 8 630 1981 25
    ## 6 9 649 2011 25

    You can also write back to a database using the dbWriteTable function. The following code snippet writes a new table to my current connection, called ‘NewDatabaseTable’ using the local data.frame called MyTable_local (that we created in the last section). The append option indicates whether to add to an existing table or not, and overwrite is just what it sounds like:

    dbWriteTable(DBI_Connection,"NewDatabaseTable", MyTable_local, overwrite=TRUE)


    Database connections are common in most analysis and data science settings. R can easily connect to relational databases, reading and writing with either RODBC or dplyr/DBI packages. I prefer the DBI route as it is faster, plugs into dplyr allowing ‘piping’ between functions, and allows us to work with data whilst it is still on the database server. This post just scratches the surface, but there are many more options for executing procedures, using different schema etc. This post has been targeted at working with Microsoft SQL Server, but these processes will work well with other databases by switching to the right drivers.

    This blog was written by Chris Mainey, Intelligence Analyst from the Health Informatics team at University Hospitals Birmingham NHS Foundation Trust. The blog was originally posted here:https://mainard.co.uk/post/database-connections-in-r/