• NHS Number Validation

    NHS Numbers – the backbone of the NHS

    NHS Numbers are the best. They are numbers that follow you through life, from birth to death, changing only with certain situations and are easy to check for validity from just the numbers themselves.

    What? You can check to see if a number is a genuine NHS Number without any other information?!

    Now, I’ve worked in the NHS for more years than I care to count and I never realised there was an algorithm to these numbers. I stumbled across this fact through Twitter of all places. Yes, Twitter, the place you’ll find NHS-R enthusiasts tweeting things like this:

    https://twitter.com/sellorm/status/1171858506057682944 which I saw retweeted by @HighlandR (John MacKintosh).

    I’ve only just started doing things on GitHub but @Sellorm wasn’t very close in saying there may be millions of packages as there are, in fact there are only 2 for R, but surprisingly, not a single one in SQL.

    I installed the package, had a quick play around and looked at the code on GitHub. The downside for this package, for me, was that you need to feed in your data using vectors and I like using data tables. Plus, I didn’t necessarily want a list of outputs like TRUE, TRUE, FALSE, TRUE but I wanted to see the NHS numbers that aren’t valid. Still, I wouldn’t have got so far, so quickly, without @Sellorm’s code, excellent notes and thoroughly written Readme file and so the moral of the story is, even if the package doesn’t work for a task or problem, you may still be able to use parts of the code.

    The Readme on https://github.com/sellorm/nhsnumber, which you can see when you scroll down past all of the file directory looking things, included a link to the wiki page on NHS Numbers and its algorithm check https://en.Wikipedia.org/wiki/NHS_number. Again, I had no idea this existed!

    Wiki, like R, is open sourced and one of the links was out of date. I’d found a really useful document on NHS Numbers a while back https://www.closer.ac.uk/wp-content/uploads/CLOSER-NHS-ID-Resource-Report-Apr2018.pdf, so, in the spirit of open source, I updated the Wiki page. Proud moment!

    The algorithm is quite simple and another package on GitHub https://github.com/samellisq/nhsnumbergenerator generates numbers using it but I didn’t spend too long on these packages as I decided to do my own script, nothing fancy, no loops, functions or packages…

    library(tidyverse)
    
    Validity <- df %>% 
      mutate(length = nchar(NHSNumber),
             A = as.numeric(substring(NHSNumber,1,1)) *10,
             B = as.numeric(substring(NHSNumber,2,2)) *9,
             C = as.numeric(substring(NHSNumber,3,3)) *8,
             D = as.numeric(substring(NHSNumber,4,4)) *7,
             E = as.numeric(substring(NHSNumber,5,5)) *6,
             G = as.numeric(substring(NHSNumber,6,6)) *5,
             H = as.numeric(substring(NHSNumber,7,7)) *4,
             I = as.numeric(substring(NHSNumber,8,8)) *3,
             J = as.numeric(substring(NHSNumber,9,9)) *2,
             End = as.numeric(substring(NHSNumber,10,10)),
             Total = A+B+C+D+E+G+H+I+J,
             Remainder = Total %% 11,
             Digit = 11- Remainder,
             Summary = case_when(Digit == 10 ~ 999,
                                 Digit == 11 ~ 0,
                                 TRUE ~ Digit),
             Valid = case_when(Summary == End & length == 10 ~ TRUE,
                               TRUE ~ FALSE
                               )) %>% 
      filter(Valid == FALSE)
    

    Importing data

    Data is imported into R via Excel or, in my case, I used a SQL connection which worked better as a couple of our systems hold many hundreds of thousands of NHS Numbers and I wanted to check them all. Excel might have got a bit of indigestion from that. Also, it meant I wasn’t storing NHS Numbers anywhere to then run through R. My systems are secure but it’s always worrying having such large amounts of sensitive data in one place and outside of an even securer system like a clinical database.

    Data doesn’t always import into R the same way and for mine I needed to remove NULLs and make the NHSNumber column, which was a factor, into a character and then numeric format:

    df <- data %>% 
      filter(!is.na(NHSNumber)) %>% 
      mutate(NHSNumber = as.numeric(as.character(NHSNumber)))
    

    Factors

    Factors are a new thing for me as but, as I understand it, they put data into groups but in the background. For example, if you had male, female and other these would be 3 factors and if you join that to another data set which doesn’t happen to have “other” as a category the factor would still linger around in the data, appearing in a count as 0 as the table still has the factor information but no data in that group.

    To manipulate factor data I’ve seen others change the format to character and so that’s what I did. I’ve done similar things in SQL with integers and dates; sometimes you have to change the data to an ‘intermediate’ format.

    Validity code – explanation

    This code uses dplyr from tidyverse to add columns:

    Validity <- df %>% 
      mutate(length = nchar(NHSNumber), 
    

    An NHS number must be 10 characters and nchar() reminds me of LEN() in SQL which is what I would use to check the length of data.

    One thing I didn’t code, but I guess people may want to check, is that all characters are numeric and no letters have been introduced erroneously. That’s something to consider.

         A = as.numeric(substring(NHSNumber,1,1)) *10,
         B = as.numeric(substring(NHSNumber,2,2)) *9,
         C = as.numeric(substring(NHSNumber,3,3)) *8,
         D = as.numeric(substring(NHSNumber,4,4)) *7,
         E = as.numeric(substring(NHSNumber,5,5)) *6,
    

    I didn’t use F as it’s short for FALSE which changed the meaning and colour of the letter. It can be used as a column name but I missed it out for aesthetics of the script!

         G = as.numeric(substring(NHSNumber,6,6)) *5,
         H = as.numeric(substring(NHSNumber,7,7)) *4,
         I = as.numeric(substring(NHSNumber,8,8)) *3,
         J = as.numeric(substring(NHSNumber,9,9)) *2,
    

    There is possibly a much smarter way of writing this, perhaps a loop. Samuel Ellis’ package nhsnumbergenerator creates a table using sequences:

    checkdigit_weights = data.frame(digit.position=seq(from=1, to = 9, by =1),
                                    weight=seq(from=10, to = 2, by =-1)
    )
    

    note he uses = rather than <-

    I like this code. It’s very concise, but it just was easier writing out the multiplications for each part of the NHS Number; the first number is multiplied by 10, the second by 9, the third by 8 and so on.

         End = as.numeric(substring(NHSNumber,10,10)),
    

    I put this in as the final number in the sequence is the check for the later Digit (also called checksum).

         Total = A+B+C+D+E+G+H+I+J,
    

    Just adding all the multiplied columns together.

         Remainder = Total %% 11,
    

    This gives the remainder from a division by 11.

         Digit = 11- Remainder,
    

    11 take away the remainder/checksum.

         Summary = case_when(Digit == 10 ~ 999,
                             Digit == 11 ~ 0,
                             TRUE ~ Digit),
    

    Lots of people use if(else) in R but I like case_when because it’s like SQL. The lines run in logical order:

    • when the digit = 10 then it’s invalid so I put in 999 as that’s so high (in the hundreds) it shouldn’t match the Digit/Checksum (which is a unit),
    • if it’s 11 then change that to 0 following the methodology,
    • else just use the digit number.

    Actually, thinking about it I probably don’t need the 10 becomes 999 as 10 could never match a single unit number. Perhaps that’s redundant code.

         Valid = case_when(Summary == End & length == 10 ~ TRUE,
                           TRUE ~ FALSE
                           )) %>% 
    

    Case_when again but this time to get the TRUE/FALSE validity. If the number generated is the same as the last digit of the NHS Number AND the length of the NHS Number is 10 then TRUE, else FALSE.

    I liked this like of code as it was a bit strange saying TRUE then FALSE but it’s logical!

      filter(Valid == FALSE)
    

    Just bring back what isn’t valid.

    Did it work?

    I think it did. If I got anything wrong I’d love to get feedback but I ran through many hundreds of thousands of NHS Numbers through it and found…..

    No invalid numbers

    I possibly should have checked beforehand but I suspect our clinical systems don’t allow any incorrect NHS Numbers to be entered in at source. Still, it was fun and could be applied to manually entered data from data kept in spreadsheets for example.

    An interesting blog

    As here were no SQL code scripts on GitHub for NHS Number validations I did a quick search on the internet and found this: https://healthanalyst.wordpress.com/2011/08/21/nhs-number-validation/ which is a blog by @HealthAnalystUK from 2011. The reason I’m referring to it in this blog is because HealthAnalystUK not only shared SQL code that looked very similar to the code here but also R and uses it as a function:

    NHSvalidation <- function(NHSnumber){
    
    NHSlength<-length(NHSnumber)
    
    A<-as.numeric(substr(NHSnumber,1,1))
    B<-as.numeric(substr(NHSnumber,2,2))
    C<-as.numeric(substr(NHSnumber,3,3))
    D<-as.numeric(substr(NHSnumber,4,4))
    E<-as.numeric(substr(NHSnumber,5,5))
    F<-as.numeric(substr(NHSnumber,6,6))
    G<-as.numeric(substr(NHSnumber,7,7))
    H<-as.numeric(substr(NHSnumber,8,8))
    I<-as.numeric(substr(NHSnumber,9,9))
    J<-as.numeric(substr(NHSnumber,10,10))
    
    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))
    
    {UniformNumberCheck<-1}
    
    else
    
    {UniformNumberCheck<-0}
    
    Modulus<-((A*10)+(B*9)+(C*8)+(D*7)+(E*6)+(F*5)+(G*4)+(H*3)+(I*2))
    Modulus<-(11-(Modulus%%11))
    
    if (
    
    ((Modulus==J) & (UniformNumberCheck!=1) & (NHSlength==10))|((Modulus==11) & (J==0) & (UniformNumberCheck!=1) & (NHSlength==10)))
    {ReturnValue<-1}
    else
    {ReturnValue<-0}
    
    return(ReturnValue)
    }
    

    I hadn’t coded the check for repeating numbers:

    if ((A==B)&(B==C)&(C==D)&(D==E)&(E==F)&(F==G)&(G==H)&(H==I)&(I==J))

    and I couldn’t find any reference to this in the Wiki page or the document from the University of Bristol so I’m unsure if this is a part of the methodology. If it is, then I’ve seen at least 1 NHS Number that would fail this test.

    A conclusion

    If anyone has created a package or script for NHS number checks and wants to share please feel free to write a blog. NHS-R Community also has a GitHub repository at https://github.com/nhs-r-community where code like this blog can go (I wrote this in Rmarkdown).

    Blogs can be emailed to nhs.rcommunity@nhs.net and get checked by a group of enthusiast volunteers for publishing.

    This blog was written Zoë Turner, Senior Information Analyst at Nottinghamshire Healthcare NHS Trust.

    @Letxuga007 and @AppliedInfoNott

  • SQL Server Database connections in R

    Getting data into R

    One of the things I found most difficult when learning R was getting data from our SQL Servers into R to analyse. It is easy to load csv files, or pull example datasets from packages, but a little more knowledge is required to connect to external databases. I think this is a common problem for my NHS colleagues when learning R and probably for others too. This post is a brief introduction to the two main ways to pull data in to R using RODBC and using dplyr‘s ’DBI-compliant’ database connections.
    I’ll be talking about connections with Microsoft SQL Server (over a local network), but this can also extend to other types of database by using different drivers, or other network set-ups using the right authentication.

    Where is the data stored?

    If you are using SQL Server to store your data, it is written into files on the database server and accessed using SQL scripts in your chosen interface (I use SQL Server Management Studios). Whilst the data are stored on disk on the SQL Server, R stores data in memory (RAM) on your machine. This has the advantage of quick access, but you can easily run out of memory with larger datasets, or processes that build larger matrices (like some types of statistical model). If memory is an issue, you will probably get the error message: Error: cannot allocate vector of size .... If you hit this situation, I’d recommend trying the data.table as an alternative to using data.frames. It is much faster, and has a lower memory footprint. Here’s a great blog post about it from Megan Stodel: https://www.meganstodel.com/posts/data-table/

    Two common methods

    There are two common methods of connection, both of which use Open Database Connectivity (ODBC) drivers:

    1. The RODBC package.
    2. The DBI system, using dplyr, dbplyr and odbc.

    Both of these create a connection, using a ‘connection string’ with the details of server/credentials etc., this can be used to create a connection object, from which we can pull data into R or manipulate it in the database.

    A note on security

    To prevent publishing our server or database names in this post, I’ve used an alias that goes and looks them up internally. There are a few options for doing this in your own code, but I’ve added them to my ‘.Renviron’ file, as SERVER and DATABASE variables. My code looks them up each time using the Sys.getenv function that you’ll see in the examples below.

    This has meant using the paste function to concatenate the variables together in the ROBDC example. You won’t have to do this in your own work if you replace the server and database names directly.

    1. RODBC

    This is the simpler of the two interfaces, and uses slightly older code. It can be used to connect to anything that uses Open Database Connectivity (ODBC). I’ll define a connection string to a database server, a database, and a table called ‘MyTable’ that has some dummy data in it. If you haven’t got any of the packages used in this post, you can install them with: install.packages("RODBC") for example.

    library("RODBC") 
    
    #Connection string 
    RODBC_connection <- odbcDriverConnect(paste('driver={SQL 
                                          Server};server=',                                       
                                          Sys.getenv("SERVER"),                                       
                                          ';database=',                                       
                                          Sys.getenv("DATABASE"),                                       
                                          ';trusted_connection=true', sep = "")) 
    
    # e.g. with a server called "Cliff" and a database called "Richard" your string would be: 
    # driver={SQL Server};server=Cliff;database=Richard;trusted_connection=true')                            
    
    dt1 <- sqlFetch(channel=RODBC_connection, sqtable = "MyTable") 
    
    # Load data from SQL query 
    dt2 <- sqlQuery(channel=RODBC_connection, query = "select TOP 100 * from MyTable") 

    Quite straightforward to use! In the example above, I specified trusted_connection=true. In a windows environment, this passes your windows credentials to the server. Since we use these for access permissions on our SQL Servers, we can use this method with no issues. You can, instead, specify a username (uid) and a password (pwd): see the help files for more details, using: ?odbcDriverConnect.

    You can also use RODBC to write back to database tables, choosing whether or not to append your results using the append and safer arguments. Not appending means you will overwrite the table:

    sqlSave(channel = RODBC_connection, dat = dt2, tablename = "Mytable_version2", append = FALSE, safer = FALSE) 

    There are lots of other functions included with RODBC to allow you to see structures etc. The package vignette is a very helpful place to go for this, along with the help files.

    Remember to disconnect at the end of your session:

    odbcClose(RODBC_connection) 

    If you do this a lot, you might find Gary Hutson’s recent post, showing how to wrap some of this into a function, a useful addition. Check it out here: http://hutsons-hacks.info/rodbc-helper-function.

    2. DBI  dplyr

    The RODBC interface was simple, quick, and you may not need to consider another approach, but I prefer to use the tidyverse functions linked to dplyr. These functions are maturing in the last couple of years, and have a few major advantages:

    • Work with tidyverse functions, including dplyr verbs and the pipe %>%
    • Faster than RODBC to import data
    • Can be used to work with data in the database, without importing it into R.

    The connection string is slightly different, and we require a few more packages to make this work. You need to make sure you have the following installed:

    • dplyr – to make the tbl and use it, we’ll work with dplyr syntax.
    • DBI – a common Database Interface engine for use in S and R (see: https://cran.r-project.org/web/packages/DBI/vignettes/DBI-1.html)
    • dbplyr – this add-on package allows translation from dplyr to SQL.
    • odbc– provides the odbc drivers, but you could use the functions below with other drivers instead.
    library(dplyr) 
    library(dbplyr)
    library(odbc)
    library(DBI)

    DBI_Connection <- dbConnect(odbc(),
    driver = "SQL Server",
    server=Sys.getenv("SERVER"),
    database=Sys.getenv("DATABASE")
    )

    Now we can define a table as if it was part of our R workspace, using the connection object and the names of the table in the database. We can then interact with it directly using dplyrglimpse is a useful function that shows you the column names, datatypes and top few entries:

    MyTable<-tbl(DBI_Connection, "MyTable") 

    glimpse(MyTable)
    ## Observations: ??
    ## Variables: 7
    ## Database: Microsoft SQL Server
    ## $ id <int> 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
    ## $ Org <chr> "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "...
    ## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201...
    ## $ month <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, ...
    ## $ Category_1 <dbl> 35395, 21126, 9248, 4049, 5323, 16867, 9916, 12404,...
    ## $ Cateogry_2 <dbl> 39293, 24860, 11031, 5812, 6876, 18475, 12976, 1657...
    ## $ events <int> 1986, 429, 460, 301, 289, 1172, 446, 756, 663, 874,...

    MyTable %>%
    filter(year ==2015) %>%
    group_by(month) %>%
    summarise(AvgEvents = mean(events),
    MaxEvents = max(events),
    N = n()) %>%
    arrange(month)
    ## # Source: lazy query [?? x 4]
    ## # Database: Microsoft SQL Server
    ## # Ordered by: month ## month AvgEvents MaxEvents N
    ## <int> <int> <int> <int>
    ## 1 4 624 1986 25
    ## 2 5 658 1948 25
    ## 3 6 671 2068 25
    ## 4 7 669 1879 25
    ## 5 8 630 1981 25
    ## 6 9 649 2011 25

    dplyr can then be used to do fairly complex things in just a few lines. The example below is not very well thought-out, but it takes data from the database into a summary plot in just a few lines. I’m filtering the data for 2015 and passing it directly into ggplot2. I’ve set a few options for a box plot, but it’s quite minimal, and our data has remained in the database and not been imported to R.

    library(ggplot2) 

    MyTable %>%
    filter(year ==2015) %>%
    ggplot(aes(y=events, x=factor(month), group=factor(month))) +
    geom_boxplot(fill = "dodgerblue2", alpha=0.6, )+
    labs(title = "Monthly Distribution of Events", x="Month", y="Events")

    You can, of course, write an SQL query directly using the dbSendQuery function. This executes the query on the server-side only, so if you want the results to be returned back to R, you need to use dbFetch as well. You might need this approach if you are doing fancy queries, or things that are specific to a database environment that don’t yet have translators in dplyr.

    SomeRecords <- dbFetch(dbSendQuery(DBI_Connection, "Select TOP 100 * from MyTable")) 

    #or

    SomeRecords <- dbSendQuery(DBI_Connection, "Select TOP 100 * from MyTable") %>%
    dbFetch()

    You may not need to write a custom query for everyday use, but you are still likely to need to pull the data from the server into memory in R sometimes. For me, this is often to build models on it, as that isn’t supported in-database. You can use the collect function for this. For example, using part of the query from earlier as an example:

    MyTable_local<- MyTable %>%   
    filter(year ==2015) %>%
    group_by(month) %>%
    summarise(AvgEvents = mean(events),
    MaxEvents = max(events),
    N = n()) %>%
    arrange(month) %>%
    collect()

    print(MyTable_local)
    ## # A tibble: 6 x 4 
    ## month AvgEvents MaxEvents N
    ## <int> <int> <int> <int>
    ## 1 4 624 1986 25
    ## 2 5 658 1948 25
    ## 3 6 671 2068 25
    ## 4 7 669 1879 25
    ## 5 8 630 1981 25
    ## 6 9 649 2011 25

    You can also write back to a database using the dbWriteTable function. The following code snippet writes a new table to my current connection, called ‘NewDatabaseTable’ using the local data.frame called MyTable_local (that we created in the last section). The append option indicates whether to add to an existing table or not, and overwrite is just what it sounds like:

    dbWriteTable(DBI_Connection,"NewDatabaseTable", MyTable_local, overwrite=TRUE)

    Summary

    Database connections are common in most analysis and data science settings. R can easily connect to relational databases, reading and writing with either RODBC or dplyr/DBI packages. I prefer the DBI route as it is faster, plugs into dplyr allowing ‘piping’ between functions, and allows us to work with data whilst it is still on the database server. This post just scratches the surface, but there are many more options for executing procedures, using different schema etc. This post has been targeted at working with Microsoft SQL Server, but these processes will work well with other databases by switching to the right drivers.

    This blog was written by Chris Mainey, Intelligence Analyst from the Health Informatics team at University Hospitals Birmingham NHS Foundation Trust. The blog was originally posted here:https://mainard.co.uk/post/database-connections-in-r/