Joining the dots between Britain’s historical railways using Wikidata – Part One

A bit of background

The evening before Code The City 18 I started to think about what fun project to spend the day doing at our one day mini-hack event. After reading Ian Watt’s blogpost about Wikidata and spending 10 minutes or so playing around with it, I decided a topic for further experimentation was required.

At the time of writing, I’m just over a third of the way through my very interesting part-time online MA Railway Studies at University of York. Looking at Britain’s railways from their very beginning, there are many railway companies from 1821 onwards. Some of these companies merged, some were taken over, others just disappeared whilst others were replaced by new companies. All these amalgamations eventually led to the “Big Four” groupings in 1923 and then on to British Railways in 1948’s railway nationalisation. British Railways rebranded as British Rail in 1965 and then splintered into numerous companies as a result of the denationalisation of the 1990s.

With the railway companies appearing in some form or another in Wikipedia, I thought it would be useful to be able to pick any railway company and view the chain of companies that led to it and those that followed. The ultimate goal would be to be able to bring up the data for British Rail and then see the whole past unfold to the left and the future unravel to the right. In theory at least, Wikidata should allow me to do that. 

No software coding skills are required to see the results of my experimentation: by clicking on the links provided (usually directly after the code) it is possible to run the queries and see what happens. However, using the code provided as a start, it is possible to build on the examples to find out things for yourself.

Understanding Wikidata and SPARQL

SPARQL is the query language used to retrieve various data sets from Wikidata via the Wikidata Query Service.

As is always the case with anything software related, the examples and tutorials never seem to handle those edge cases that you seem to hit within the first 5 minutes. Maybe I hit these cases so soon due to jumping straight from the “hello world” of requesting all the railway companies formed in the UK to trying to build the more complex web of railway companies rather than working my way through all the simpler steps? However, my belief is to fail quickly, leaving plenty of time left to fail some more before succeeding, after all you never see a young child plan out a strategy when they are learning to get the different shaped blocks through the correct holes.

At the time of writing…

Comments about the state of certain items of data were relevant at the time I wrote this article. As one of the big features of Wikidata is it constantly being updated, expanded and corrected, the data referenced may have changed by the time you read this. Some of the changes are those I’ve made in reaction to my discoveries, but I have left some out there for others to fix.

A simple list

First off, I created a simple SPARQL query to request all the railway companies that were formed in the UK.

SELECT ?company ?companyLabel
WHERE {
?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
SERVICE wikibase:label {
bd:serviceParam wikibase:language “en”.
}
}
ORDER BY (lcase(?companyLabel))

Run the query

The output of this query can be seen by running it yourself here by clicking on the white-on-blue arrow displayed on the Wikidata Query Service console. It is safe to modify the query in the console without messing up my query as any changes cause a new bookmarked query to be created. So please experiment as that’s the only way to learn.

Now what does the query mean and where do all those magic numbers come from?

  • wdt:P31 means get me all Wikidata triples (wdt) that have the property instance of  (P31) that is has a value of railway company (Q249556).
  • wdt:P17 means get me all of the results so far that have the property country (P17) set to United Kingdom (Q145).

Where did I get those numbers from? First, I went to Wikipedia and searched for a railway company, LMS Railway, and got to the page for London, Midland and Scottish Railway. From here I went to the Wikidata item for the page.

Screen grab of Wikipedia page for LMSR that shows how to get to the Wikidata
Wikipedia page for LMSR that shows how to get to the Wikidata

From here I hovered my pointer over instance ofrailway companycountry and United Kingdom to find out those magic numbers.

Screen grab of the Wikidata page for LMSR
Wikidata page for LMSR

Some unexpected results

Some unexpected companies turned up in the results list due to my query not being specific enough. For example, Algeciras Gibraltar Railway Company, located in Gibraltar but with headquarters registered in the UK the data has its country as United Kingdom. To filter my results down to just those that are located in the UK I tried searching for those that had the located in the administrative territorial entity (P131) with any of the following values:

  • England (Q21) 
  • Northern Ireland (Q26)
  • Scotland (Q22)
  • Wales (Q25)
  • Ireland (Q57695350) (covering 1801 – 1922)

using this query:

SELECT ?company ?companyLabel ?countryLabel
WHERE {
  VALUES ?country { wd:Q21 wd:Q26 wd:Q22 wd:Q25 wd:Q57695350 }
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145; wdt:P131 ?country.
  SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
}
}

ORDER BY (lcase(?companyLabel))

Run the query

However, that dropped my result set from 228 to 25 due to not all the companies having that property set.

Note: When trying to find out what values to use it is often quick and easy to run a simple query to ask Wikidata itself. To find out what all the values were for UK countries I wrote the following that asked for all countries that had an instance of value of country within the United Kingdom (Q3336843):

select ?country ?countryLabel
WHERE {
  ?country wdt:P31 wd:Q3336843 .
  SERVICE wikibase:label {
 bd:serviceParam wikibase:language "en".
  }
}

Run the query

Dates

In order to see what other information could easily be displayed for the companies, I looked at the list of properties on the London, Midland and Scottish Railway. I saw several dates listed so decided that would be my next area of investigation. There is an inception (P571) date that shows when something came into being, so I tried a query with that:

SELECT ?company ?companyLabel ?inception
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  ?company wdt:P571 ?inception 
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query.

This demonstrated two big issues with data. Firstly, the result set had dropped from 228 to 106 indicating that not all the company entries have the inception property set. The second was that only one, Scottish North Eastern Railway, had a full date (29th July 1856) specified, the rest only had a year and that was being displayed as 1st January for the year. Adding the OPTIONAL clause to the inception request returns the full data set with blanks where there is no inception date specified.

SELECT ?company ?companyLabel ?inception
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

Railway companies are not a straightforward case when it comes to a start date due to there being no one single start date. Each railway company required an Act of Parliament to officially enable it to be formed and grant permission to build the railway line(s). This raises the question: is it the date that Act was passed, the date the company was actually formed or the date that the company commenced operating their service that should be used for the start date? Here is a revised query that gets both the start time (P580) and end time(P582) of the company if they have been set:

SELECT ?company ?companyLabel ?inception ?startTime ?endTime
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  OPTIONAL { ?company wdt:P580 ?startTime. }
  OPTIONAL { ?company wdt:P582 ?endTime. }
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

Unfortunately, of the 228 results only one, London, Midland and Scottish Railway, has a startTime and endTime, and London and North Eastern Railway is the only with endTime. Based on these results it looks like that startTime and endTime are not generally used for railway companies. Looking through the data for Scottish North Eastern Railway did turn up a new source of end dates in the form of the dissolved, abolished or demolished (P576) property. Adding a search for this resulted in 9 companies with dissolved dates. 

SELECT ?company ?companyLabel ?inception ?startTime ?endTime ?dissolved
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  OPTIONAL { ?company wdt:P580 ?startTime. }
  OPTIONAL { ?company wdt:P582 ?endTime. }
  OPTIONAL { ?company wdt:P576 ?dissolved. }
  SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

There is no logic in which companies have this property: they range from Scottish North Eastern Railway dissolving on 10th August 1866 to several that ended due to the formation of British Railways, the more recent British Rail ending on 1st January 2001, and the short lived National Express East Coast (1st January 2007 – 1st January 2009). However, once again, the dates are at times misleading as, in the case of National Express East Coast, it is only the year rather than full date in the inception and dissolved, abolished or demolishedproperties.

Some of the railway companies, such as Underground Electric Railways Company of London, have another source of dates and that is as part of the railway company value for their instance of. It is possible to extract the start and end dates if they are present by making use of nested conditional queries. In the line:

OPTIONAL {?company p:P31 [ pq:P580 ?companyStart]. }

the startTime property is extracted from the instance of property if it exists.

SELECT ?company ?companyLabel ?inception ?startTime ?endTime ?dissolved ?companyStart ?companyEnd
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  OPTIONAL { ?company wdt:P580 ?startTime. }
  OPTIONAL { ?company wdt:P582 ?endTime. }
  OPTIONAL { ?company wdt:P576 ?dissolved. }
  OPTIONAL { ?company p:P31 [ pq:P580 ?companyStart] . }
  OPTIONAL { ?company p:P31 [ pq:P582 ?companyEnd] . }
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

Another date that can be used to work out the start and end of the companies can be found hanging off the values of very useful pair of properties: replaced by (P1366) and replaces (P1365). This conveniently connects into the next part of my exploration that will follow in Part Two. Although, as with many railway related things, the exact time of arrival of part two cannot be confirmed.

[Header photograph taken by Andrew Sage]

So, how did CTC6 – The History Jam go?

Intro

On 19th and 20th March we found ourselves back at Aberdeen Uni with 35 or so eager hackers looking to bring to life a 3D Virtual Reality historic model of Aberdeen city centre using new open data. So how did it go?

This time we were more prescriptive than at any previous Code The City event. In the run up to the weekend we’d identified several sub-team roles.

  • Locating, identifying and curating historic content
  • Transcribing, formatting and creating valid open data
  • Building the 3D model, fixing and importing images and
  • Integrating and visualising the new data in the model.

Andrew Gives us an Open Data Briefing
Andrew Gives us an Open Data Briefing

After some breakfast, an intro and a quick tutorial on Open Data, delivered by Andrew Sage, we got stuck in to the work in teams.

Old Books into Open Data

We were lucky to have a bunch (or should be a shelf-ful) of city librarians, an archivist and a gaggle of other volunteers working on sourcing and transcribing data into some templates we’d set up in Google Sheets.

Given that we’d been given scanned photos of all the shop frontages of Union Street, starting in 1937, of which more below, we settled on that as the main period to work from.

The Transcribers
The Transcribers

The librarians and helpers quickly got stuck into transcribing the records they’d identified – particularly the 1937-38 Post Office Directory of Aberdeen. If my arithmetic is correct they completely captured the details of 1100+ business in the area around Union Street.

At present these are sitting in a Google Spreadsheet – and we will be working out with the librarians how we present this as well structured, licensed Open Data. It is also a work in progress. So there are decisions to be made – do we complete the transcription of the whole of Aberdeen – or do we move onto another year? e.g. 1953 which is when we have the next set of shopfront photos.

We have a plan
We have a plan

Music, pictures and sound

At the same time as this transcription was ongoing, we had someone sourcing and capturing music such might have been around in 1937, and sounds that you might have heard on the street – including various tram sounds – which could be imported into the model.

Sounds of the city
Sounds of the city

And three of us did some work on beginning an open list of gigs for Aberdeen since the city had both the Capitol Theatre (Queen, AC/DC, Hawkwind) and the Music Hall (Led Zeppelin, David Bowie, Elton John) on Union Street. This currently stands at 735 gigs and growing. Again, we need to figure out when to make it live and how.

The 3D Model

At CTC5 back in November 2015, Andrew Sage had started to build a 3D model of the city centre in Unity. That relied heavily on manually creating the buildings. Andrew’s idea for CTC6 was to use Open Streetmap data as a base for the model, and to use some scripting to pull the building’s footprints into the model.

Oculus Rift Headset and a 1937 Post Office Directory
Oculus Rift Headset and a 1937 Post Office Directory

This proved to be more challenging than expected. Steven Milne has written a great post on his site. I suggest that you read that then come back to this article.

As you’ve hopefully just read, Steve has identified the challenge of using Open Streetmap data for a project such as this: the data just isn’t complete enough or accurate enough to be the sole source of the data.

While we could update data – and push it back to OSM, that isn’t necessarily the best use of time at a workshop such as this.

An alternative

There is an alternative to some of that. All 32 local authorities in Scotland maintain a gazetteer of all properties in their area. These are highly accurate, constantly-update, and have Unique Property Reference Numbers (UPRNs) and geo-ordinates for all buildings. This data (if it was open) would make projects such as this so much easier. While we would still need building shapes to be created in the 3D model, we would have accurate geo-location of all addresses, and so could tie the transcribed data to the 3d map very easily.

By using UPRNs as the master data across each transcribed year’s data we could match the change in use of individual buildings through time much more easily.  There is a real need to get the data released by authorities as open data, or at least with a licence allowing generous re-use of the data. ODI Aberdeen are exploring this with Aberdeen City Council and the Scottish Government

Fixing photos

We were given by the city’s Planning Service, scans of photos of shopfronts of Union Street from a number of decades from 1937, 1953 and on to the present. Generally the photos are very good but there are issues: we have seams between photos which run down the centre of buildings. We have binding tape showing through etc.

A split building on Castle Street.
A split building on Castle Street.

These issues are not so very difficult to fix – but they do need someone with competence in Photoshop, some standard guidance, and workflow to follow.

We started fixing some photos so that they could provide the textures for the building of Union Street in the model. But given the problems we were having with model, and a lack of dedicated Photoshop resource we parked this for now.

Next steps

Taking this project forward, while still posing some challenges, is far from impossible. We’ve shown that the data for the entire city centre for any year can be crowd-transcribed in just 36 hours. But there are some decisions to be made.

Picking up on the points above, these can be broken down as follows.

Historical Data

  • Licensing model to be agreed
  • Publishing platform to be identified
  • Do we widen geo-graphically (across the city as a whole) or temporally (same area different years)
  • Creating volunteer transcribing teams, with guidance, supervision and perhaps a physical space to carry out the work.
  • Identify new data sources (e.g. the Archives were able to offer valuation roll data for the same period – would these add extra data for buildings, addresses, businesses?)
  • Set up a means for the general public to get involved – gamifying the transcription process, perhaps?

Photos

  • Similar to the data above.
  • We need clear CC licences to be generated for the pictures
  • Crowdsource the fixing of the photos
  • Create workflow, identify places for the pictures to be stored
  • Look at how we gamify or induce skilled Photoshop users to get involved
  • Set up a repository of republished, fixed pictures, licensed for reuse, with proper addressing system and naming  – so that individual pictures can be tied to the map and data sources

The 3D Model

  • Build the model
  • Extend the coverage (geographically and through time)
  • Establish how best to display the transcribed data – and to allow someone in the 3D environment to move forward and back in time.
  • Look at how we can import other data such as a forthcoming 3D scan of the city centre to shortcut some development work
  • Look at how we can reuse the data in other formats and platforms (such as Minecraft) with minimum rework.
  • Speed up the 3D modelling by identifying funding streams that could be used to progress this more quickly. If you have suggestions please let us know as a comment below.

Taking all of this forward is quite an undertaking, but it is also achievable if we break the work down into streams and work on those. Some aspects would benefit from CTC’s involvement – but some could be done without us. So, libraries could use the experience gained here to set up transcribing teams of volunteers – and be creating proper open data with real re-use value. That data could then easily be used by anyone who wants to reuse it – e.g. to create a city centre mobile app which allows you to see any premises on Union Street, call up photos from different periods, find out which businesses operated there etc

As the model takes shape and we experiment with how we present the data we can hopefully get more attention and interest (and funding?) to support its development. It would be good to get some students on placements working on some aspects of this too.

Aberdeen City Council is working with the Scottish Cities Alliance to replace and improve the Open Data platforms for all seven Scottish cities later this year – and that will provide a robust means of presenting and storing all this open data once in place but in the mean time we will need to find some temporary alternatives (perhaps on Github ) until we are ready.

We welcome your input on this – how could you or your organisation help, what is your interest, how could you assist with taking this forward? Please leave comments below.

Code The City 6 – The History Jam was funded by Aberdeen City Council’s Libraries service and generously supported by Eventifier who provided us with free use of their Social Media platform and its LiveWall for the sixth consecutive time!.

History Jam – #CTC6

The History Jam (or Code The City #6 if you are counting) will take place on 19-20 March 2016 at Aberdeen University. You can get one of the remaining tickets here.

As an participant, you’ll be bringing history to life, creating a 3D virtual reality map of a square mile of Aberdeen’s city centre. You’ll be gathering data from a variety of historical sources, transcribing that and creating new open data. You’ll import that into the the 3D model.
And there will also be the opportunity to re-use that data in imaginative new ways. So, if you are a MineCraft fan, why not use the data to start building Minecraft Aberdeen.
This is not one of our usual hacks, whatever that is! This time around instead of you proposing problems to be worked on, we’ve set the agenda, we’ll help form the teams, and provide you with more guidance and support.
If you come along you’ll learn open data skills. And you’ll get a year’s free membership of the Open Data Institute!

Saturday’s Running Order

09:00 Arrive in time for fruit juices, coffee, pastries, or a rowie.

09:30 Introduction to the day
09:45 Briefing of teams and, if you are new to Open Data, a quick training session

10:15 Split into three streams:

  • Sourcing and curation of data, and structuring capture mechanisms
  • Transcribing,  cleaning, and  publishing open data
  • Creating the 3D map, importing and visualising the data

CTC-6-Flow1

Throughout the day we’ll have feedback sessions, presenting back to the room on progress. We’ll write blog posts, create videos, photograph progress.

13:00 Lunch (the best sandwiches in Aberdeen)

More workstream sessions with feedback and questions.

17:30 (or so) Pizza and a drink

We’ll wind up about 8pm or so if you can stay until then

Sunday’s Agenda

09:30 arrive for breakfast

10:00 kick off

Morning sessions

12:30 Lunch

Afternoon sessions

16:00 Show and Tell sessions – demonstrate to the room, and a wider audience, and preserve for posterity what you’ve produced in less than 36 hours. You’ll be amazed!

Codethecity 6 – Save the date

Codethecity #6 – History Jam.

March 19-20.

Save the date, or grab yourself a ticket.

Last October,  we ran Code The City 5, a hack weekend focussing on history and heritage. One of the projects that we worked on was “Histerical Time Machine”. Over the course of 36 hours Andrew Sage built a 3D virtual reality model of a small section of the centre of Aberdeen as it was in the 1870s. Alongside this we saw a group of volunteers look at a variety of books (remember those!) and extract some key dates to attach to the model.

CTC6

Normally at Code The City events, participants are encouraged to come up with the ideas and are free to work on what they want. We are going to run Code The City 6 , which will be supported by Aberdeen City Libraries, slightly differently. We are all going to work on sub-projects that will contribute to the further development of the 3D Historical model, and its data needs.

What we’ll be doing

We will have teams working on tasks such as:

  • expanding the geographic coverage of the model. There will tutorials for developers on how to do that in Unity.
  • transcribing a variety of pre-identified sources of data and putting those into structured formats that will drive not only this project but also be a permanent free source of open data.
  • visualising data within the 3d model and in other ways.
  • writing about the weekend, taking photos, blogging and using social media to tell the world what we are doing.

Who we need

We need a broad range of skills. These include:

  • Transcribers, Note-Takers, Genealogists, Historians, Librarians
  • Developers, Designers, UX Experts, Data wranglers, Data Scientists,
  • Bloggers, photographers, videographers,

If you come with enthusiasm we’ll find a role for you – and we guarantee you will leave having picked up new skills, and with a sense of having achieved something useful with lasting value.

Bring a laptop to work on if you have one. An HDMI cable is useful for connecting to external monitors. If you want to take pictures or videos please take your own equipment if you can. We’ll make sure there is wifi for you to use. Don’t forget your power cable, and any chargers you might need too!

What does the weekend look like?

We’ll open up on Saturday morning at 9am, then have breakfast and a briefing session at 9.30am. We’ll then break into groups, each with a distinct set of tasks.

During the day, each group will work on their projects and periodically report back to the room on progress.

We’ll have lunch then continue in the same fashion for the afternoon. We’ll have pizza and drinks for tea and those who want to work on into the evening.

On Sunday we’ll start at 10am with breakfast and a quick update. We’ll follow the same pattern as Saturday, with lunch. We’ll wind up with a grand show and tell session in the late afternoon.

Sound good?

Then save the date, or grab yourself a ticket.