Plotting thematic maps with linked-data-frames

This article is by Robin Gower, of Swirrl and Infonomics

Sarah Roberts
Swirrl’s Blog

--

We’ve been working on an R package for downloading and working with linked data: Linked Data Frames.

Image by Robin Gower

This tutorial shows you how you can use this package to download statistics from PublishMyData sites, along with rich descriptions of reference data for things like geographies and time intervals.

The library takes care of making SPARQL requests and interpreting the results, so you can get on with analysis and visualisation.

We’ve not submitted the package to R’s CRAN repository yet, so you’ll need to install the development version using devtools:

install.packages("devtools")
devtools::install_github("Swirrl/linked-data-frames")
library(ldf)

You then need to find a dataset using from a PMD site, for example statistics.gov.scot.

We’re going to look at the Dependency Ratio dataset. This describes the proportion of the population who are dependent on those in work, i.e. the proportion of children (aged 0 to 15) and older people (aged 65 and over) expressed as a percentage of people aged 16 to 64.

In order to download the data we’ll need an identifier for this data cube. We can find this at the top on the API tab for this dataset under the heading URI: http://statistics.gov.scot/data/population-estimates-dependency.

To download the cube and it’s reference data we use the get_cube function:

cube <- get_cube("http://statistics.gov.scot/data/population-estimates-dependency",
endpoint="https://statistics.gov.scot/sparql", include_geometry = T)

This returns a tibble with a row per observation and a column for each of their properties — things like location (reference_area) and date (reference_period). The values of the dependency ratio statistics are in the ratio column.

cube## # A tibble: 846 x 5
## reference_area reference_period measure_type ratio unit_of_measure
## <ldf_rsrc> <ldf_ntrv> <ldf_rsrc> <dbl> <ldf_rsrc>
## 1 West Lothian 2006 Ratio 49.6 People
## 2 West Lothian 2007 Ratio 49.7 People
## 3 West Lothian 2004 Ratio 49.6 People
## 4 West Lothian 2005 Ratio 49.6 People
## 5 West Lothian 2009 Ratio 50.4 People
## 6 West Lothian 2010 Ratio 50.8 People
## 7 West Lothian 2008 Ratio 50.1 People
## 8 West Dunbartonshire 2016 Ratio 55.2 People
## 9 West Dunbartonshire 2017 Ratio 55.9 People
## 10 West Dunbartonshire 2015 Ratio 54.5 People
## # … with 836 more rows

As you can see, the other “reference data” columns have special types: LDF Resource (ldf_rsrc) and Interval (ldf_ntrv). The values in these columns are themselves resources with orthogonal attributes - i.e. descriptions that wouldn’t fit into this table.

The reference areas, for example, are described in another table with columns for their label, notation, and geometry etc.

description(cube$reference_area)## # A tibble: 47 x 5
## uri label notation boundary parent
## <chr> <chr> <chr> <chr> <chr>
## 1 http://statistic… West Dun… S120000… POLYGON ((-4.60987426… http://statistic…
## 2 http://statistic… West Lot… S120000… POLYGON ((-3.74402118… http://statistic…
## 3 http://statistic… Scotland S920000… MULTIPOLYGON (((-5.10… <NA>
## 4 http://statistic… Clackman… S120000… POLYGON ((-3.62835330… http://statistic…
## 5 http://statistic… Dumfries… S120000… MULTIPOLYGON (((-3.54… http://statistic…
## 6 http://statistic… East Ayr… S120000… POLYGON ((-4.24690760… http://statistic…
## 7 http://statistic… East Lot… S120000… MULTIPOLYGON (((-2.36… http://statistic…
## 8 http://statistic… East Ren… S120000… POLYGON ((-4.52992477… http://statistic…
## 9 http://statistic… Na h-Eil… S120000… MULTIPOLYGON (((-6.26… http://statistic…
## 10 http://statistic… Falkirk S120000… POLYGON ((-4.02012977… http://statistic…
## # … with 37 more rows

If we were to attempt to record these reference data attributes on the main observation table — e.g. adding columns like reference_area_geometry - then we’d have columns with lots of duplicates as every area appears multiple times (once for each year of data). For boundaries in particular this would quickly use up a lot of memory redundantly. It would also mean that we’d need to coordinate updates on multiple rows at once (e.g. if a geometry changed).

This approach — having separate but related tables — is called database normalisation and it’s the foundation of relational database systems and indeed the tidy data approach of R’s tidyverse.

Here we’re relating the observation and reference data tables using URIs. This is the basic principle of linked-data (explained in our introduction to RDF).

If you’re interested in learning more about how to work with these resource descriptions then you might like to read the introduction to LDF in the package documentation.

We can use these descriptions to support analysis and visualisation. First we’ll use the geometries to plot the data on a thematic map.

The modern approach to spatial data in R is to create a simple features object — this is a data frame with a particular column chosen to be the active geometry. We’ll need to “denormalise” the observation table and area descriptions into a single data frame for this purpose.

Since we only want to present one value per area, we’ll filter to a slice for the latest year. We are also going to lift the area’s URI out of the description into the table so that we can use it for the join.

library(dplyr)latest_year <- cube %>% 
filter(reference_period==max(reference_period)) %>%
mutate(reference_area_uri=uri(reference_area))

Then we’ll create a sf data frame with the geometries, parsing the well-known-text (i.e. WKT, pronounced “wicket”) boundaries with the projection used by Ordnance Survey. We join this to the slice of observations to create the data we need for the map.

library(sf)geometries <- st_as_sf(description(cube$reference_area), wkt="boundary", crs="WGS84")map_data <- left_join(geometries, latest_year, by=c("uri"="reference_area_uri"))

Then we can use ggplot to render a map:

library(ggplot2)ggplot(map_data) + 
geom_sf(aes(fill=ratio), colour="white") +
scale_fill_viridis_c("Dependency Ratio") +
labs(title="Dependency Ratio in Scotland",
subtitle="Proportion of the population who are dependent on those in work") +
theme_minimal()

It’s easy to create visualisations like this with the LDF package because a lot of hard work has already gone in behind the scenes to get the data right. For a start, all the datasets have been transformed into the RDF data cube standard which means we can extract and tabulate them using this one process.

We’re also able to leverage standards ways to describe reference data. Since the dataset identifies the areas involved using ONS Geography codes we can use these to find rich descriptions with boundary geometries for plotting.

Likewise the dataset describes periods using the reference.data.gov.uk interval URI patterns which means we can get datetime data without needing to parse strings (like we would with csv). This makes it easier to make time series charts, for example:

ggplot(cube, aes(int_end(reference_period), ratio, group=uri(reference_area))) +
geom_line(colour="grey") +
geom_line(colour="black", data=filter(cube, label(reference_area)=="Glasgow City")) +
labs(title="Dependency Ratio in Glasgow",
subtitle="Dependency on those in work has fallen in Glasgow while it rose elsewhere",
x="Year", y="Dependency Ratio") +
theme_minimal()

By adopting standards we can build tools that are reusable across many datasets.

You can find a more thorough explanation of the package and it’s features on the linked-data-frames site itself.

We look forward to hearing how you get on!

--

--