Using R to Analyse Linked Data

We’ll be doing a series of guides looking at how to do stuff with linked data. We’ll start off pretty simple with some basic concepts, examples and tools, and progressively get more technical over the coming months.

We’ll be mostly using our own platform, Publish My Data, as the linked data store in the guides, but we’ll be using it in conjunction with other open source tools to try and demonstrate the power of linked data, as well as showing how simple it can be to use.

INTRODUCTION

This, then, is the first in that series. We’re going to start by looking at how we can use a simple SPARQL query to get data out of the data store. We’re then going to take that data into the open source statistics package, R, and do a really simple chart.

SPARQL

The first thing that we’re going to do is get some data. The way that we’re going to do this is by using a SPARQL endpoint. This is a way of querying data that is in RDF format, which is how linked data is stored. SPARQL is similar to SQL, but the difference is big enough that it can still be confusing without some basic background knowledge of linked data. To find out more about SPARQL, there are useful lessons from Cambridge Semantics here and Learning SPARQL by Bob DuCharme is a great read. And for an introduction to RDF, this set of lessons again from Cambridge Semantics is good, as is this primer from W3C.

The SPARQL endpoint is a way of interacting with data in a linked data store. We’re going to use the Scottish Government’s Statistics site: http://statistics.gov.scot/ in this example. We can either click Tools > Sparql query, or go to http://statistics.gov.scot/sparql to get to the window we can use to query data in the store. This is what that query window looks like:

This is the default query, and will print a table of the first 100 triples in the Scottish Statistics datastore. We aren’t going to worry about that, though. We want to get some proper data out of the site. If we click Explore > All datasets, we can see a list of all the datasets in the store. Let’s say we want to start looking at data about the first dataset in the list — Alcohol-related Hospital Discharges. In the query window, we can swap the existing query for this:

PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sdmx: <http://purl.org/linked-data/sdmx/2009/concept#>
PREFIX data: <http://statistics.gov.scot/data/>
PREFIX sdmxd: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX mp: <http://statistics.gov.scot/def/measure-properties/>
PREFIX stat: <http://statistics.data.gov.uk/def/statistical-entity#>
SELECT ?areaname ?nratio ?yearname ?areatypename WHERE {
?indicator qb:dataSet data:alcohol-related-discharge ;
sdmxd:refArea ?area ;
sdmxd:refPeriod ?year ;
mp:ratio ?nratio .
?year rdfs:label ?yearname .

?area stat:code ?areatype ;
rdfs:label ?areaname .
?areatype rdfs:label ?areatypename .
}

This query is saying give me four columns of data, called ‘areaname’, ‘nratio’, ‘yearname’ and ‘areatypename’. The data should be from the dataset Alcohol-related discharge, and we want all triples that have a year and an area associated with them. We want the query to show the area name (not its code in the datastore), the ratio (number of hospital discharges per 100,000 people), the year, and the type of area. While the SPARQL resources linked to earlier in this piece should help here, it can be helpful to think of this query as a pattern-matching exercise. For all of the triples in the datastore, we are defining a pattern, and searching for triples that match the template. The variables ‘?areaname’, ‘?nratio’ etc are created to hold the additional parts of triples that we’ve asked for, but not defined.

The PREFIXES at the top of the query are shorthand, and make our query easier to read. Where we use, for example ‘qb’ in the WHERE clause, that is shorthand for <http://purl.org/linked-data/cube#>.

I built this query up sequentially, previewing the data each time, until I ended up with these results:

(You can see this query in action on the Scottish Government Statistics site here)

This is perfect now for doing some light analysis. Whilst we could stop and download the data from here to make some charts in Excel, we’re going to go a bit further, and take it into R.

R

So we’ve got the text of that SPARQL query saved — in Notepad++, Sublime, or some other text tool. The next step is to fire R up. I use R Studio — it’s more user-friendly than the standard R console. You can get RStudio here: https://www.rstudio.com/products/rstudio/. When you first launch RStudio, this is what you get:

View of R-Studio on Mac OSX

There are a few different sections here — the console pane on the left, which is where we type in our commands. Top right shows the datasets, variables etc in the project, as well as keeping a track of all commands (this is super-useful — more on that in a bit). The bottom right pane shows the files and packages (like plugins) in the project, as well as help, and any charts you make.

To get started, we’re going to want to create a new project. Go File > New Project and then follow the instructions to create a new directory for your project.

Once done, we need to install the package that supports interacting with SPARQL endpoints. In the bottom right pane, click Packages > Install, and install the package called SPARQL. After doing this, the SPARQL package will be available to be called into each of your projects in the future.

The next step is to pull our data about alcohol-related discharge into R, so that we can do some analysis. Although we’ve installed the SPARQL package, we still need to bring the library into our project. To do this, type into the console (the left pane):

library(SPARQL)

This means we now have additional commands available to us in the console. To get some data, we first need to tell R where the endpoint is:

endpoint <- 'http://statistics.gov.scot/sparql'

This creates a variable called ‘endpoint’, and gives it the value ‘http://statistics.gov.scot/sparql’ which is the location of our endpoint.

Next, we need to create another variable, called ‘query’ with the SPARQL query text, from before:

query <- 'PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sdmx: <http://purl.org/linked-data/sdmx/2009/concept#>
PREFIX data: <http://statistics.gov.scot/data/>
PREFIX sdmxd: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX mp: <http://statistics.gov.scot/def/measure-properties/>
PREFIX stat: <http://statistics.data.gov.uk/def/statistical-entity#>
SELECT ?areaname ?nratio ?yearname ?areatypename WHERE {
?indicator qb:dataSet data:alcohol-related-discharge ;
sdmxd:refArea ?area ;
sdmxd:refPeriod ?year ;
mp:ratio ?nratio .
?year rdfs:label ?yearname .

?area stat:code ?areatype ;
rdfs:label ?areaname .
?areatype rdfs:label ?areatypename .
}'

So we now have two variables, one with the address of the endpoint, and one with the query text. We now need to use a command from the SPARQL library to get the data. Here we create a variable called qd and fill it with the results of the function to get the data we want:

qd <- sparql(endpoint,query)

We can see the contents of this data frame by typing into the console:

head(qd)

This will return similar data to our last preview in the actual SPARQL endpoint from before. We still need to tidy this up, by removing the namespaces element of the dataframe (ie keeping only the results part):

df <- qd$results

Previewing this dataframe now gives us a much cleaner set of data:

head(df)
areaname nratio yearname areatypename
1 Stirling 551.6 2008–2009 Council Areas
2 Borders 627.9 2008–2009 Health Board Areas
3 Scotland 828.4 2008–2009 Country
4 Orkney 881.5 2008–2009 Health Board Areas
5 Glasgow City 1601.2 2008–2009 Council Areas
6 S Lanarkshire 674.0 2008–2009 Council Areas

To get the data we want, we need to do some filtering — we only want to look at Council Areas for the purpose of this analysis, and we only really want data for the period 2012–2013. There are two ways that we could do this — the first way is to do the filtering in the SPARQL query, the second is to do it here in R. From a data transfer point of view, it is more efficient to do the filtering in the SPARQL, especially for complex queries, or those where there are lots of results. As this is not a particularly big dataset, and we may want to look at other years’ data, we’ll filter in R.

To do this, we create another variable, in this case called df2013, and load into it only the results that we want:

df2013 <- df[(df$areatypename == 'Council Areas' & df$yearname == '2012-2013'), ]

So we now have a dataframe which only contains values from 2012/13, for council areas in Scotland.

If we want to make a chart of this data, there are several ways we could do it. For this, I’ve decided to use ggplot2 — a library available within R to draw charts. First, we need to bring the library into our project:

library(ggplot2)

Next, we need to draw the chart. Using the most basic settings, we create the variable ‘c’, and put a chart in it:

c <- ggplot(data = df2013, aes(x=areaname, y=nratio)) + geom_bar(stat='identity')

This creates a chart object, sets the datasource to be our filtered dataframe ‘df2013’, sets the x-axis to be our council areas, the y-axis to be the ratio, and makes it a bar chart. By typing ‘c’ into the console, and pressing enter, you should now see in the Plots tab in the bottom right pane this monstrosity:

There are plenty of resources available on the internet to show the various settings for formatting charts in R. In our case, after a few tweaks, I cleaned it up with this:

c <- ggplot(data = df2013, aes(x=reorder(areaname, -nratio), y=nratio, fill=areaname)) + theme_bw() + geom_bar(stat='identity') + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + ggtitle('Alcohol-related Hospital Discharges 2012–2013 (Rate per 10,000 people)') + labs(x='Council Area', y='Rate per 100,000 people') + theme(legend.position='none')

While this is a much longer line of code, all we are doing is asking the chart to order the areas by the value of the measure, in this case alcohol-related hospital discharges. We are also telling the chart to fill each bar with a different colour, and apply a black and white theme to remove the background. We are also rotating the x-axis labels to make them easier to read, changing the axis and chart titles, and getting rid of the legend. This is what we end up with:

This result is a much cleaner, easier to read chart, and clearly shows that in 2012/2013, the Orkney Islands had the highest rate of Alcohol-related hospital discharges. From a policy point of view, this is interesting. As an outside observer, I would have expected the cities to be highest — Glasgow, Edinburgh, etc — certainly not the Orkney Islands, and this probably merits further testing to see if each year gives the same value.

HISTORY

Earlier on, I mentioned that the History panel is pretty useful. In this pane, every command you execute in the console is recorded. You can select any line of code in this tab, and send it back to the console, which you can then execute, or modify and execute. You can also select multiple lines in one go, and send them to the console. This ability to redo commands is really powerful, and if you keep your history clean, by deleting any commands that didn’t work as you hoped, you will have a reproducable set of commands, that you can save into a text file. You can then share this with other RStudio users, who can execute your code, and it will replicate your results.

This is exactly what I have done here — created a portable block of text that anyone with RStudio can use to form the basis for a query from the Scottish Statistics site:


library(SPARQL)
library(ggplot2)
endpoint <- 'http://statistics.gov.scot/sparql'
query <- 'PREFIX qb: <http://purl.org/linked-data/cube#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sdmx: <http://purl.org/linked-data/sdmx/2009/concept#>
PREFIX data: <http://statistics.gov.scot/data/>
PREFIX sdmxd: <http://purl.org/linked-data/sdmx/2009/dimension#>
PREFIX mp: <http://statistics.gov.scot/def/measure-properties/>
PREFIX stat: <http://statistics.data.gov.uk/def/statistical-entity#>
SELECT ?areaname ?nratio ?yearname ?areatypename WHERE {
?indicator qb:dataSet data:alcohol-related-discharge ;
sdmxd:refArea ?area ;
sdmxd:refPeriod ?year ;
mp:ratio ?nratio .
?year rdfs:label ?yearname .

?area stat:code ?areatype ;
rdfs:label ?areaname .
?areatype rdfs:label ?areatypename .
}'
qd <- SPARQL(endpoint,query)
df <-qd$results
df2013 <- df[(df$areatypename == 'Council Areas' & df$yearname == '2012-2013'), ]
c <- ggplot(data = df2013, aes(x=reorder(areaname, -nratio), y=nratio, fill=areaname)) + theme_bw() + geom_bar(stat='identity') + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + ggtitle('Alcohol-related Hospital Discharges 2012–2013 (Rate per 100,000 people)') + labs(x='Council Area', y='Rate per 100,000 people') + theme(legend.position='none')
c

Feel free to paste this into your own installation of RStudio, and change some of the parameters, or play around with the design of the chart. For example — you could see whether the Orkney Islands have the highest alcohol-related discharge rates in other years, or whether 2012/13 is just a blip.

The next instalment will look at bringing two datasets in from the datastore, and testing the correlation between them, again in R, which you can view here.