• Our first ever NHS-R webinar!

    We ran our first ever NHS-R webinar on Wednesday 19th February and it went down a storm! Chris Mainey facilitated a session on Database Connections in R which was attended by a peak audience of 72 individuals.

    The webinar began with some mentimeter questions to get to know more about who was on the line. After an icebreaker question revealed that the majority of people’s favourite TV series is Breaking Bad, we found out that (of those who answered) approximately 25% were Analysts, 40% were Senior Analysts, 15% were Intelligence/Analytical Leads and 20% were some other role. There was also a broad selection of organisations represented on the line, including approximately 30% from provider trusts, 20% from CSUs, 15% from local authorities and 5% from CCGs. Just over 30% were from other organisations and it would be interesting to delve more deeply into where these individuals are from.

    We also asked about people’s experience of the topic and what they were hoping to get out of the session. When asked what the current level of understanding around database connections in R on the line was, the average score of those who answered was 1.4/5, suggesting that this was a relatively new topic for most individuals. Moreover, regarding what individuals wanted to get out the session, people wanted to: gain a basic understanding of how to make database connections in R, how to make SQL connections, write temp tables and learn tips, tricks and best practice on how database connections can be made.

    Chris then began, explaining the fundamental elements of SQL (Structured Query Language) before highlighting the two common methods for creating database connections – the RODBC package and DBI system. Both can be used to create a connection object which can be used to manipulate or transfer data into R.

    Chris firstly went into more detail about the RODBC package, showing code for creating connections. He then explored DBI in more detail, including: making connections, how SQL can be used with DBI, writing to databases, using tables in the database, constructing dplyr queries and using SQL and returning data into R. He ended his webinar by taking us through an example script in R, which was a great way of putting the learning in context, before giving participants the opportunity to ask more specific questions about database connections in R.

    We obtained some fantastic feedback about the webinar. The top 3 words that participants used to describe the webinar were “useful”, “interesting” and “clear”. Moreover, the average rank that participants gave for: satisfaction with the webinar, whether they would recommend to others, relevance in helping them to achieve work goals and increasing their understanding of database connections in R was between 4-5 out of 5.

    We also wanted to understand what other webinar topics participants may be interested in, in future. Shiny, RMarkdown, ggplot2 and time series analysis were some of the most popular suggestions.

    Finally, we would like to thank Chris Mainey for doing a fantastic webinar and to our participants for tuning in, being engaged and asking some great questions! The material from the webinar is available on GitHub here and a recording of the webinar can be found on our NHS-R Community website here.

    Moreover, we are planning to run topical webinars on the third Wednesday of each month between 1-2pm. Our next webinar is taking place on Wednesday 18th March from 1-2pm on “Functional programming with Purr” led by Tom Jemmett. Keep your eyes peeled on our twitter page and website for the link to sign up.

     If you are interested in being part of the NHS-R community, please join our slack channel here to keep up to date with the goings on tips and tricks that people have for using R in health and care. Use the #webinars channel to suggest topics for future webinar sessions, or to put your name forward to run your own!

  • The first ever train the trainer class 2019

    The Strategy Unit, AphA (Association of Professional Healthcare Analysts) and NHS-R joined forces to deliver the first ever train the trainer (TTT) class on the 11-12th December 2019 in Birmingham. We allocated spaces for two AphA members per branch and in the end we had 18 people from across the UK NHS,  including clinical and non-clinical staff who worked as commissioners, providers or national NHS bodies.

    As this was the first ever TTT course, we were quite unsure how things would turn out, especially because teaching R is not easy. One key issue is that trainers have to think about the background of novices to R – this can range from no programming experience at all to years of SQL experience. Two other challenges with R are that there are many ways of doing things (e.g. using base R vs tidyverse) and installation can be tricky for some.

    To address these challenges, we designed the course to incorporate considerable emphasis on evidence based teaching practice, plenty of learning by doing and sharing underpinned by the use of liberating structures techniques to capture the voice of delegates. The aim was to give delegates the skills, confidence and capability to deliver an introduction to R for those in healthcare, based on a field tested slide deck.

    The class was fabulous – they were engaged and enthusiastic, learned a great deal from each other and the facilitators and made great progress – https://twitter.com/NHSrCommunity/status/1205165380039299076/photo/1. Indeed it was inspiring to see how NHS staff can work effectively together, despite being from different organisations. Our first class have self-organised into an action learning set who will: train others, learn from each other, feedback their learning into the NHS-R Community, and have an annual TTT Class of 2019 reunion!

    Whilst there are areas where we would refine the TTT, we were pleased with the first iteration and feel confident that we have a committed and capable set of trainers who can deliver the Introduction to R for those in healthcare at locations near you.

    If you are interested in hosting an Introduction to R training session or being involved in future iterations of the TTT, please get in touch via nhs.rcommunity@nhs.net and we will see what can be done (no promises).

    This blog was written by Professor Mohammed A Mohammed, Principal Consultant at the Strategy Unit

  • A new kid on the NHS-R block

    My journey into data science took a baby step a few years ago when I learned SQL during my global health work overseas. A dearth of good data and data sets is every global health worker’s nightmare.

    Realising R in data science is the next best thing since sliced bread, I decided to give it a go. My initial introduction was aided by my daughter who volunteered to teach me during her summer break from UCLA. That was a terrible idea. We didn’t speak to each other for days. (It was worse than learning to drive from your dad!) My self-esteem took a nosedive.

    I searched through various resources such as YouTube, MOOC, and edX, which only confused me further. Then I stumbled across the NHS-R Community website and voila!

    No more learning about irises, motorcars in the USA, or global flight statuses. The NHS-R Community provided relevant teaching materials presented in the simplest possible way. The gap minder data set is a Godsend for any global health worker. I enthusiastically attended all the R workshops I could. R was becoming an addiction.

    Last week, I attended NHS-R Community conference in Birmingham. I am pleased to say this conference was worthy of the investment of my time, money and energy. All the ‘biggies’ in R were there to facilitate you through your anxieties (although, sadly, no Hadley Wickham!).

     Meeting absolute novices was a great boost for previously dented self-esteem. Helping these novices made me realise how it is actually possible to learn more by teaching and finding ways to explain concepts for others to understand.

    My biggest achievement is converting my husband (a senior consultant in the NHS) to R. He went from rolling his eyes at my NHS-R-venture in early summer to signing up to NHS-R community this week: a long stride in a short time. I have managed to corrupt his hard drive with R. Let’s see how far he is willing to go!

    I have come across quite a few reluctant R-ists like my husband in the NHS. So now the question is: how do we make a success of converting people at various stages of their careers into using a wonderfully useful open-source resource like R?

    Melanie Franklin, author of ‘Roadmap to Agile Change Management’, puts across the core principle of creating small, incremental changes to be implemented as soon as possible so that organisations can realise benefits sooner rather than later and achieve rapid return on investment. This is precisely what the NHS-R Community can do.

    An understanding of the behaviours and attitudes is key for change implementation. When a change is sought in an organisation, there is a period of ‘unknown’ amongst the staff. Questions are asked. Do we learn completely new things? Do we unlearn what we had learned before? Will this change make us redundant? This anxiety can be age-related. While younger members are ready to embrace change, senior staff may resist change and become an obstacle. Encouraging people to participate in taking that initial step at every level is imperative, as is encouraging them to express their anxieties. This approach should not just be top-down or bottom-up. Making a mistake is a part of a process and creating a no-blame culture within an organisation is essential. Building an environment that supports exploration and rewarding and celebrating participation in new experiences.

    Health data science is a melange de competences: not just a group of statisticians and data analysts. A data analyst will no doubt do a great job in analysing your data, but will they ask the right questions in the right context in an environment like the NHS? This space belongs to the health care workforce. As Mohammed rightly pointed out in Birmingham: ‘Facilitate an arranged marriage between two and hope for a happy union’. Lack of time to practice is one of the biggest challenges many of my fellow learners had cited. Unfortunately, there is no simple solution. Just keep trudging along and you will get through this steep learning process.

     En finalement, I would like to propose a few suggestions to the NHS-R Community (If they are not already being thought about).

    • Streamlining the training.

    Many groups up and down the country are providing training in R, but these fantastic training opportunities are few and far. The NHS-R Community is a great body to ensure that the training programs are more frequent and streamlined.

    • Standardising the training.

    Diverse levels of these training programs mean a varied level of R-competence. The NHS-R Community could set up levels of competence (Levels I to III, for instance) and people can advance their level of competence in subsequent workshops, which may motivate people to find time to practice. An online competency exam for each level may be another way to build the capacity of the NHS workforce.

    • Accreditation and validation of R skills. This is an eventuality just like any skill in health care.

    Happy R-ing!!

    I am hoping to up my game to assist overseas researchers who don’t have means and mentors to learn R. There are many synthetic datasets available for them to learn and practice. I will continue learning as there is never an ending to learn new things.

    Nighat Khan is a global ehealth worker based in Edinburgh

  • But this worked the last time I ran it!

    A common scenario: you are new to a programming language, chuffed that you’ve done a few bits and progressing well. A colleague hands you their code to run a regular report. That’s fine, you’ve got a smattering of understanding, and this is existing code, what could go wrong? But the code doesn’t run. You check through it but nothing looks obviously out of place and then your colleague adds to your utter confusion by saying it works on their machine.

    Another scenario: you are new to a programming language and have written some code. It’s been working surprisingly well and you feel very pleased with your progress but then, suddenly, it fails. You look on at the computer, stunned. It worked the last time you ran it! You’ve made no changes and yet lots of red errors now appear and these errors are, quite frankly, utterly baffling and even googling them turns up all manner of strange discussions which might as well be in Old English (sort of familiar but you have no idea what it’s saying).

    The solution: well there won’t necessarily be just one solution but here are a few things I’ve picked up in my early days using R Studio. I’m still in those early days and I probably haven’t encountered all that could possibly go wrong so please add to these and definitely comment if I’ve made any glaring errors!

    “My colleague can run this script on this very same computer but I can’t! What’s that all about?”

    R Studio allows you to install and run your own packages (if your network allows) and that’s really useful when you just want to try something out, follow up on a recommendation or install something a training course has required. Given our strict network and IT installations this is quite a liberating experience!

    But what isn’t apparent when you are merrily installing packages is that these are installed to your own folder so on a shared network this may not be accessible by a colleague. Step one in the solving the problem is to check the package is installed on your machine and your profile.

    You may now be familiar with:

    install.packages(“dplyr”) to install and then library(dplyr)

    but consider using

    if (!require(dplyr)) install.packages(“dplyr”)

    so RStudio will install this package if it is missing. Very useful when sharing R scripts as they can just be run with no typing by the recipient.

    “I ran this code the other day and it was fine and now I get errors – but I haven’t done anything!”

    This happened recently to a colleague and prompted me to write this blog because I thought, this is probably the kind of thing that happens all of the time and if no one tells you this how could you know? Well there is Google but it’s too much to type I ran this code the other day and it now gives me an error….

    My colleague’s code had been working, she hadn’t made any changes but one day it just failed. It wasn’t as if she ran it, ran it again seconds later and it failed, this was a run-it-one-day, it works and run-it-the-next-day, it fails. She asked if I could help and all I could think of was to run it myself. Not exactly expert support I thought. A bit like calling IT with a computer problem and being asked if you’ve rebooted your machine but a bit more advanced than “have you switched it on?”. Something strange happened when I ran it; it worked.

    Just as a plug for another blogger she was recreating the plot from this blog:

    https://www.johnmackintosh.com/2017-12-21-flow/

    https://github.com/johnmackintosh/RowOfDots

    This was puzzling but I had a faint recollection from other R people’s stories that you should keep your packages are up to date. One course had even said about updating these regularly and had, thankfully, shown us what to do.

    Packages are regularly being updated, a few tweaks here and there I guess. Plus many are built on other packages (like dplyr and ggplot2) so if they are updated then it’s like a domino effect. R Studio is nicely set up so you don’t have to go to the internet to find out individually what you need to update, you just need to go to ‘Packages’ in the bottom right hand panel of R Studio, select ‘Update’ which has a green circle with an arrow and it brings up a list of what needs updating.

     

     

     

    If you’ve not done this for a while you may have quite a few updates!

    <<<

    Eagle eyed readers may recognise this Public Health package. If not, check it out!

     

     

     

     

     

     

     

     

    If you are like me you may have installed some packages that you now rarely use and have no idea what they are. They may ask the following in the console (bottom left of the screen):

    Do you want to install from sources the packages which need compilation?

     

     

     

     

     

     

     

    This prompt is so that the package can be updated by building it on your computer. I’ve got a couple of packages that I have tried to do this but each time I go to check for updates they are still requesting an update so I just say no so I can fly through the other updates.

    ################################################

    Finally, a bit of a vague warning as I don’t understand this part but I once updated packages after I’d run a couple of scripts. This meant that a couple of the packages that needed updating were already loaded and so things got a bit muddled. I’m not entirely sure if this is a problem but I now shut all projects and code and run a new R Studio screen to do updates.

    ################################################

    This blog was written by Zoe Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Foundation Trust.

  • Text Mining – Term Frequency analysis and Word Cloud creation

    Text Mining – Term Frequency analysis and Word Cloud creation in R

    Analysing the pre-conference workshop sentiments

    Loading in the required packages:

    1. install_or_load_pack <- function(pack){
    2.   create.pkg <- pack[!(pack %in% installed.packages()[, "Package"])]
    3.   if (length(create.pkg))
    4.     install.packages(create.pkg, dependencies = TRUE)
    5.   sapply(pack, require, character.only = TRUE)
    6.   }
    7. packages <- c("ggplot2", "tidyverse",  "data.table", "wordcloud", "tm", "wordcloud2",
    8.               "scales", "tidytext", "devtools", "twitteR", "caret", "magrittr", "RColorBrewer", "tidytext", "ggdendro",
    9.               "tidyr", "topicmodels", "SnowballC", "gtools")
    10. install_or_load_pack(packages)

    This function was previously covered in blog post: https://momsite.co.uk/blog/a-simple-function-to-install-and-load-packages-in-r/.

    Here I specify that I want to load the main packages for dealing with sentiment and discourse analysis in R. Libraries such as tm, wordcloud and wordcloud2 are loaded for working with this type of data.

    Choosing the file to import

    The file we have to import is a prepared csv file and instead of hard coding the path to load the file from I simply use:

    11. path <- choose_files()

    This is a special  function which allows you to open a dialog UI from R:

     

     

     

     

     

     

     

     

     

     

     

     

    From this dialog I select the csv file I want to be imported. Once I have selected the csv and hit open, the path variable will be filled with the location of the file to work with.

    Creating the R Data Frame

    To create the data frame I can now pass the variable path to the read_csv command:

    12.  workshop_sentiment <- read_csv(path, col_names = T)

    This will read the textual data from the workshops in to a data frame with 2 columns. The first relates to what the attendees enjoyed about the workshop and  the second relates to improvements that can be made:

     

     

     

     

     

     

     

     

     

     

    Separate the master data frame

    The master data frame now needs to be separated into two separate data frames, as text analysis requires one column with the number of rows for each sentence, as demonstrated. Here I use maggrittr to divide this into two new data frames:

     13.  ws_highlights <- workshop_sentiment %>%
    
    14.   .[,1]
    
    15. #Copy for improvements
    
    16. ws_improvements <- workshop_sentiment %>%
    
    17.   .[,2]

    The ws_highlights data frame uses the first column and  the ws_improvements data frame uses the second.

    Function to create textual corpus

    As I want to replicate this for highlights and improvements – I have created a function that could be replicated with any text analysis to create what is known as a text corpus (see: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) this creates a series of documents, in our case sentences.

    18.  corpus_tm <- function(x){
    
    19.   library(tm)
    
    20.   corpus_tm <- Corpus(VectorSource(x))
    
    21. }

    This function allows you to pass any data frame to the function and creates a corpus for each data frame you pass to the function. The data frame would be passed to the x parameter. The VectorSource() function creates an element for each part of the corpus.

    Create Corpus for Highlights and Improvements data frame

    Now the function has been created, I can simpily pass the two separate data frames I created before to create two corpuses:

    22. corpus_positive <- corpus_tm(ws_highlights$Highlights)
    
    23. corpus_improvements <- corpus_tm(ws_improvements$Improvements)

    The code block above shows that I create a corpus for the positive (highlights) data frame and an improvements corpus. This will display as hereunder in your environment:

     

     

     

     

     

     

     

    Function to clean data in the corpus

    The most common cleaning task of working with text data is to remove things like punctuation, common English words, etc. This is something I have to repeat multiple times when dealing with discourse analysis:

    24. clean_corpus <- function(corpus_to_use){
    
    25.  library(magrittr)
    
    26.  library(tm)
    
    27.  corpus_to_use %>%
    
    28.    tm_map(removePunctuation) %>%
    
    29.    tm_map(stripWhitespace) %>%
    
    30.    tm_map(content_transformer(function(x) iconv(x, to='UTF-8', sub='byte'))) %>%
    
    31.    tm_map(removeNumbers) %>%
    
    32.    tm_map(removeWords, stopwords("en")) %>%
    
    33.    tm_map(content_transformer(tolower)) %>%
    
    34.    tm_map(removeWords, c("etc","ie", "eg", stopwords("english")))
    
    35.
    
    36. }

    The parameter here takes the corpus object previously created and uses the corpus passed to:

    • Remove punctuation
    • Strip out whitespace between each text item, as the VectorSource has stripped out each word from each sentence in the data frame
    • Change the underlying formatting of the text to UTF-8
    • Remove numbers
    • Remove common English word (stop words)
    • Change the case to lower case
    • Remove a custom vector of words to adjust for things like e.g., i.e., etc.

    To clean the corpus objects I simply pass the original corpus objects back through this function to perform cleaning:

    37. corpus_positive <- clean_corpus(corpus_positive)
    
    38. corpus_improvements <- clean_corpus(corpus_improvements)

    Inspection of one of the data frames confirms that this has successfully been cleaned:

    Create TermDocumentMatrix to attain frequent terms

    The term document matrix (explained well here: https://www.youtube.com/watch?v=dE10fBCDWQc) can be used with the corpus to identify frequent terms by classification on a matrix. However, more code is needed to do this:

    39. find_freq_terms_fun <- function(corpus_in){
    
    40.  doc_term_mat <- TermDocumentMatrix(corpus_in)
    
    41.  freq_terms <- findFreqTerms(doc_term_mat)[1:max(doc_term_mat$nrow)]
    
    42.  terms_grouped <-
    
    43.    doc_term_mat[freq_terms,] %>%
    
    44.    as.matrix() %>%
    
    45.    rowSums() %>%
    
    46.    data.frame(Term=freq_terms, Frequency = .) %>%
    
    47.    arrange(desc(Frequency)) %>%
    
    48.    mutate(prop_term_to_total_terms=Frequency/nrow(.))
    
    49.  return(data.frame(terms_grouped))
    
    50. }

    This function needs explanation. The function uses as a single parameter the corpus that you need to pass in, then a variable is created to create the doc_term_mat which uses the tm TermDocumentMatrix.

    Next, I use the findFreqTerms function to iterate from the first entry to the maximum number of rows in the matrix. These are the powerhouses of the function, as they highlight how many times a word has been used in a sentence across all the rows of text.

    The terms_grouped variable then slices the term matrix with the frequent terms, this is converted to a matrix, sum of each row are calculated i.e. the number of times the word appears. Then, a data frame is created of the terms in the function with the headings term and Frequency.

    Next, we use the power of DPLYR to use arrange by the frequency descending and to add a mutated column to the data frame to calculate the proportion of that specific term over all terms. The return(data.frame(terms_group)) then forces R to return the results of the function.

    I then pass my data frames (highlights and improvements) to the function I have just created to see if this method works:

    51. positive_freq_terms <- data.frame(find_freq_terms_fun(corpus_positive))
    
    52. improvement_freq_terms <- data.frame(find_freq_terms_fun(corpus_improvements))

    These will be built as data frames and can be viewed in R Studio’s Data environment window:

     

     

     

     

     

     

     

     

    This has worked just as expected. You could now use ggplot2 to produce a bar chart / pareto chart of the terms.

     Create a Word Cloud with the wordcloud2 package

    R has a wordcloud package that produces relatively nice looking word clouds, but wordcloud2 surpasses this in terms of visualisation. To use this function is easy now I have the frequent terms data frame – using the highlights data frame this can be implemented by using the below syntax:

    53. wordcloud2(positive_freq_terms[,1:2],
    
    54.           shape="pentagon",
    
    55.           color="random-dark")

    To use the function I pass the data frame and use the term and frequency fields only to use the visualisation. There are a number of options and these can be accessed by using the help(“wordcloud2”) function. Here I use the shape and color parameters to set the display of the word cloud:

     

     

     

     

     

     

     

     

     

     

    This can be exported in the viewer window by using the Export function:

     

     

     

     

     

     

     

     

    This word cloud relates to the pre workshop prior to the conference. I personally thought the NHS-R conference was amazing and I was honoured to have a spot to speak amongst so many other brilliant R users.

    R is so versatile – every day is like a school day when you are learning it, but what a journey.

  • Diverging Bar Charts – Plotting Variance with ggplot2

    Diverging Bar Charts

    The aim here is to create a diverging bar chart that shows variance above and below an average line. In this example I will use Z Scores to calculate the variance, in terms of standard deviations, as a diverging bar. This example will use the mtcars stock dataset, as most of the data I deal with day-to-day is patient sensitive.

    Data preparation

    The code below sets up the plotting libraries, attaches the data and sets a theme:

    1. library(ggplot2)
    2. theme_set(theme_classic())
    3. data("mtcars") # load data

    Next, we will change some of the columns in the data frame and perform some calculations on the data frame:

    1. mtcars$CarBrand <- rownames(mtcars) # Create new column for car brands and names

    As commented, this line uses the existing mtcars data frame and uses the dollar sign notation i.e. add a new column, or refer to a column, to create a column name called CarBrand. Then we assign the car brand (<-) with the rownames from the data frame. This is obviously predicated on there being some row names in the data frame, otherwise you would have to name the rows using rownames().

    Adding a Z Score calculation

    Z score is a calculation which uses the x observation subtracts said observation from the mean and divides by the standard deviation. The link shows the mathematics behind this, for anyone who is interested.

    The following code shows how we would implement this score:

    1. mtcars$mpg_z_score <- round((mtcars$mpg - mean(mtcars$mpg))/sd(mtcars$mpg), digits=2)

    The statistics behind the calculation have already been explained, but I have also used the round() function to round the results down to 2 digits.

    Creating a cut off (above/below mean)

    The next step is to use conditional algebra (first advocated by one of my heroes George Boole) to check whether the Z score I have just created is greater or less than 0:

    1. mtcars$mpg_type <- ifelse(mtcars$mpg_z_score < 0, "below", "above")

    The ifelse() block looks at whether the Z Score is below 0, if so tag as below average, otherwise show this as above.

    The next two steps are to convert the Car Brand into a unique factor and to sort by the Z Score calculations:

    1. mtcars <- mtcars[order(mtcars$mpg_z_score), ] #Ascending sort on Z Score
    2. mtcars$CarBrand <- factor(mtcars$CarBrand, levels = mtcars$CarBrand)

    Now, I have everything I need to start to compute the plot. Great stuff, so let’s get plotting.

    Creating the plot

    First, I will start with creating the base plot:

    1. ggplot(mtcars, aes(x=CarBrand, y=mpg_z_score, label=mpg_z_score))

    Here, I pass in the mtcars data frame and set the aesthetics layer (aes) of the x axis to the brand of car (CarBrand). The y axis is the Z score I created for miles per gallon (mpg) and the label is also set to the z score.

    Next, I will add on the geom_bar geometry:

    1. ggplot(mtcars, aes(x=CarBrand, y=mpg_z_score, label=mpg_z_score)) +
    2. geom_bar(stat='identity', aes(fill=mpg_type), width=.5) +

    This indicates that I need to use the mpg_z_score field by forcing the stat=’identity’ option. If this was not added, then it would simply count the number of times the Car Brand appears as a frequency count (not what I want!). Then, I stipulate the fill type of the bar to be equal to whether the value deviates above and below 0 – remember we created a field in the data preparation stage to store whether this deviates below and above 0 and called it mpg_type. The last parameter is the width parameter to indicate the width of the bars.

    Next:

    1. ggplot(mtcars, aes(x=CarBrand, y=mpg_z_score, label=mpg_z_score)) +
    2. geom_bar(stat='identity', aes(fill=mpg_type), width=.5) +
    3. scale_fill_manual(name="Mileage (deviation)",
    4. labels = c("Above Average", "Below Average"),
    5. values = c("above"="#00ba38", "below"="#0b8fd3")) +

    I use the scale_fill_manual() ggplot option to add the name to the legend, specify the label names using the combine function and stipulate that the values that are above average need to be hex coded by the value and the below values to a different code. I have weirdly chosen blue and green as an alternative to red, as I know we have accessibility there. We are nearly there, the final step is:

    1. ggplot(mtcars, aes(x=CarBrand, y=mpg_z_score, label=mpg_z_score)) +
    2. geom_bar(stat='identity', aes(fill=mpg_type), width=.5) +
    3. scale_fill_manual(name="Mileage (deviation)",
    4. labels = c("Above Average", "Below Average"),
    5. values = c("above"="#00ba38", "below"="#0b8fd3")) +
    6. labs(subtitle="Z score (normalised) mileage for mtcars'",
    7. title= "Diverging Bar Plot (ggplot2)", caption="Produced by Gary Hutson")
    8. coord_flip()

    Here, I have added the labs layer on to the plot. This is a way to label your plots to show more meaningful values than would be included by default. So, within labs I use subtitle, title and caption to add labels to the chart. Finally, the important command is to add the coord_flip() command to the chart – without this you would have vertical bars instead of horizontal. I think this type of chart looks better horizontal, thus the reason for the inclusion of the command.

    The final chart, looks as illustrated hereunder:

     

     

     

     

     

     

     

    This blog was written by Gary Hutson, Principal Analyst, Activity & Access Team, Information & Insight at Nottingham University Hospitals NHS Trust, and was originally posted here.

  • Why Government needs sustainable software too

    Unlike most of the 2017/2018 cohort, when I applied to become a fellow of the Software Sustainability Institute, I was a civil servant rather than an academic. In this blog post I want to talk about why Government needs sustainable software, the work being done to deliver it, and the lessons we learnt after the first year.

    But Government already has sustainable software…

    There’s quite a bit of disambiguation that needs to be done to the statement ‘Government needs sustainable software’. In fact, Government already has sustainable software, and lots of it. One need only look at alphagov, the GitHub organisation for the Government Digital Service. Sustainable, often open source, software is alive and well here, written by professional software developers, and in many other places in central and local Government alike. But this isn’t the whole story.

    There are other parts of Government that write software, but like many in academia, you may have a hard time convincing them of this fact. In central Government (this is where my experience lies, so I will focus largely upon it) there are literally thousands of statisticians, operational researchers, social researchers, economists, scientists, and engineers. Any one of these may be writing code in a variety of languages in the course of their daily work, but don’t identify as software developers. It’s among these professions that there are tasks that will look most familiar to the academic researcher. Government statisticians in particular are tasked with producing periodic publications which incorporate data, visualisations, and analyses, much like academic outputs.

    So in this blog post, I’m really talking about bespoke software that is used to create Government statistical publications.

    ​​​​​​​Government produces a lot of statistics

    A quick browse of GOV.UK and we can see the wide range of statistics produced: there’s monthly statistics on cattle and pig meat production from Department for Environment Food & Rural Affairs (DEFRA)search and rescue helicopter statistics from the Department for Transport and Maritime Coastguard Agency; and combat aircraft statistics in Afghanistan from the Ministry of Defence.

    In fact, at time of writing, there are over 16,000 statistical publications published on GOV.UK, and very likely more published elsewhere. If you clicked on the links above, you will notice that there is a lot of variety among the publications that reflects the diversity of the organisations that produce them. Different departments have differing technology, different levels of technical expertise, and different aims. A publication can be produced by an automated pipeline incorporating version control and continuous integration/deployment; but much more likely it is produced with manual processes that are error prone, time consuming, and difficult to document. However it is done, these publications are increasingly being produced with code in one language or another, be it Stata, SPSS, SAS, R, or Python.

    ​​​​​​​Why sustainability is so important for Government

    The reasons for sustainability in academic publications have been well documented by the Software Sustainability Institute, but I would argue that it is even more important that Government writes reproducible and sustainable software for its statistical publications. Here’s why:

    ​​​​​​​The outputs really matter

    I don’t want to downplay the importance of research outputs, publishing accurate science is critical to advancing human knowledge. What is different about research is that there is rarely a single source of truth. If a research group publishes a groundbreaking finding, we all take notice; but we don’t trust the findings until they have been replicated preferably by several other groups.

    It’s not like that in Government. If a Government department publishes a statistic, in many cases that is the single source of truth, so it is critical that the statistics are both timely and accurate.

    ​​​​​​​Publications are often produced by multiple people

    The second way that Government statistical publications differ from academic scientific publications is that they are often produced by a team of people that is regularly changing. This means that even at the point that it is being produced it needs to be easy for another member of the team to pick up the work and run with it. If someone goes on holiday, or is sick at the critical moment, their colleagues need to be able to pick up from where they left off immediately, and understand all the idiosyncrasies perfectly. The knowledge simply cannot rest in one person’s head.

    More than that, since publications are often periodic (e.g. monthly, or annual) and analysts typically change role once a year, the work will very likely need to be handed off to someone new on a regular basis. It is essential therefore that these processes are well documented, and that the code that is being handed over works as expected.

    ​​​​​​​The taxpayer pays for it

    Obviously, the longer it takes a team of statisticians to produce a statistical report in Government, the more it costs to the taxpayer, and all Government departments have an interest in being efficient, and reducing unnecessary waste.

    Additionally, since Government statistical publications are paid for by the public, where possible Government should be open and publish its workings. Coding in the open is already an important part of the culture among digital professions, adopting sustainable software practices allows statistical publications to be produced with the same openness.

    ​​​​​​​Working towards sustainability

    I started working in Government as a Data Scientist after doing a PhD and post-doc in environmental science. I’d attended two Software Carpentry workshops during this time, and wrote my PhD in LaTeX and R. On joining Government it was clear that we could apply some of these lessons to improve the reporting workflow in Government.

    Working with the Department for Digital, Culture, Media, and Sport (DCMS) we had a first attempt at implementing a reproducible workflow for a statistical publication that was being produced with manual processes using a number of spreadsheets, a statistical computing package, and a word processor. We used RMarkdown to rewrite the publication, and abstracted the logic into an R package freely available on GitHub, complete with continuous integration from travis and appveyor.

    In March of 2017 we published this work in a blog post, and worked hard to publicise this work with a large number of presentations and demonstrations to other Government departments. The prototype generated lots of interest; in particular an initial estimate that it could save 75% of the time taken to produce the same publication using the old methods.

    By November we blogged again about successful trials of this approach in two further departments: the Ministry of Justice (MoJ), the Department for Education (DfE). We also produced a GitBook describing the various steps in more detail. Most of this is sensible software development practice; but it’s something that many Government analysts have not done before.

    By the end of the year, the ideas had gained enough traction in the Government statistical community, that the Director General for the Office of Statistics Regulation (the body responsible for ensuring quality among official statistics) reported that this work was his favourite innovation of the year, although he wasn’t so keen on the name!

    Work continues to bring these techniques to a wider audience. There’s now a free online course built by one of my former colleagues to help civil servants get started, and a number of departments, particularly the MoJ are making great strides to incorporate these techniques into their workflows.

    ​​​​​​​Lessons learnt

    A year or so after we set out with the intention of bringing sustainable software to Government statisticians, here are some of the lessons that I would like to share.

    ​​​​​​​Reproducibility is technical, sustainability is social

    We called the first prototype a ‘Reproducible Analytical Pipeline’ and acronym ‘RAP’ has stuck. This is not a very good name on reflection because it belies the main difficulty

     in transitioning from manual workflows into something more automated: making it sustainable. It’s very well creating beautiful, abstracted, replicable data workflows, but they are completely useless if no one knows how to use them, or to update them. That situation is more dangerous than the manual workflows that exist in many places at present, because at least the barrier to entry for tortuous manual processes is lower: you don’t need to know how to program to interpret a complicated spreadsheet, you just need a lot of patience.

    What this move from manual to automated implies is a recognition of the need for specialists; organisations will need to recruit specialists, make use of the ones they already have, and upskill other staff. This is a challenge that all organisations will need to rise to if they are to make these new methods stick.

    This is likely to be less of a problem for academia, where within certain fields there is already an expectation that researchers will be able to use particular tools, and there may be more time to develop expertise away from operational pressures. However, there also exists a powerful disincentive: because journal articles are closer to ‘one off’ than a periodic report, it is less critical that researchers leave the code behind a paper in a good state, as they may never need to come back to it again.

    ​​​​​​​Senior buy-in is critical

    In just over a year, we went from seeing an opportunity to scaling the idea across a number of Government departments, traditionally very conservative organisations. Getting the buy-in of senior officials was absolutely critical in our ability to get the idea accepted.

    It’s important to realise early that senior managers are often interested in very different things to the users of the software, so messages need to be targeted to gain traction with the

    right audience. For instance, an incentive for managers in academia might be: mitigating the risk of errors that could lead to retraction, rather than by the expectation of cost savings.

    ​​​​​​​Time is money

    One of the reasons that we managed to make a big impact quickly is because Government departments are always keen to reduce costs. If a publication takes a team of four people a few weeks to produce, the cost quickly adds up. This is a feature of Government (and indeed industry) which is not shared by academia. Yes, it matters that work is delivered on time, but in my experience researcher time is a much more elastic resource. I was much more likely to work all evening or over the weekend as a PhD student or post doctoral researcher than I was as a civil servant; it was almost an expectation. For this reason, the financial imperative seems to be a much less powerful incentive in academia.

    ​​​​​​​It’s not all about the code

    Notwithstanding my comments about sustainability, it is important to note that reproducibility does not stop with reproducible code. We also need to worry about the data, and the environment. The former is particularly difficult in a Government setting, as one department often relies on another to provide the data, meaning that there is a less clear route to source than many academics enjoy. There are important initiatives underway in Government, such as GOV.UK Registers, which oversees the development of canonical lists of important information critical to the running of the country. Not all data can be treated in this way, and whilst taking snapshots of data may be a blunt instrument, it works when you don’t have control of where it comes from.

    ​​​​​​​Call to arms

    Almost all the projects I have referred to in this blog post are open source, and available on GitHub, so follow the links above if you are interested. There’s also two presentations on the topic available as slides (Earl conference 2017 and Government Statistical Service conference 2017) which give more technical details on the projects.

    This blog is written by Matthew Upson, Data Scientist at Juro and was originally posted on the Software Sustainability website.