Joining the dots between Britain’s historical railways using Wikidata – Part One

A bit of background

The evening before Code The City 18 I started to think about what fun project to spend the day doing at our one day mini-hack event. After reading Ian Watt’s blogpost about Wikidata and spending 10 minutes or so playing around with it, I decided a topic for further experimentation was required.

At the time of writing, I’m just over a third of the way through my very interesting part-time online MA Railway Studies at University of York. Looking at Britain’s railways from their very beginning, there are many railway companies from 1821 onwards. Some of these companies merged, some were taken over, others just disappeared whilst others were replaced by new companies. All these amalgamations eventually led to the “Big Four” groupings in 1923 and then on to British Railways in 1948’s railway nationalisation. British Railways rebranded as British Rail in 1965 and then splintered into numerous companies as a result of the denationalisation of the 1990s.

With the railway companies appearing in some form or another in Wikipedia, I thought it would be useful to be able to pick any railway company and view the chain of companies that led to it and those that followed. The ultimate goal would be to be able to bring up the data for British Rail and then see the whole past unfold to the left and the future unravel to the right. In theory at least, Wikidata should allow me to do that. 

No software coding skills are required to see the results of my experimentation: by clicking on the links provided (usually directly after the code) it is possible to run the queries and see what happens. However, using the code provided as a start, it is possible to build on the examples to find out things for yourself.

Understanding Wikidata and SPARQL

SPARQL is the query language used to retrieve various data sets from Wikidata via the Wikidata Query Service.

As is always the case with anything software related, the examples and tutorials never seem to handle those edge cases that you seem to hit within the first 5 minutes. Maybe I hit these cases so soon due to jumping straight from the “hello world” of requesting all the railway companies formed in the UK to trying to build the more complex web of railway companies rather than working my way through all the simpler steps? However, my belief is to fail quickly, leaving plenty of time left to fail some more before succeeding, after all you never see a young child plan out a strategy when they are learning to get the different shaped blocks through the correct holes.

At the time of writing…

Comments about the state of certain items of data were relevant at the time I wrote this article. As one of the big features of Wikidata is it constantly being updated, expanded and corrected, the data referenced may have changed by the time you read this. Some of the changes are those I’ve made in reaction to my discoveries, but I have left some out there for others to fix.

A simple list

First off, I created a simple SPARQL query to request all the railway companies that were formed in the UK.

SELECT ?company ?companyLabel
WHERE {
?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
SERVICE wikibase:label {
bd:serviceParam wikibase:language “en”.
}
}
ORDER BY (lcase(?companyLabel))

Run the query

The output of this query can be seen by running it yourself here by clicking on the white-on-blue arrow displayed on the Wikidata Query Service console. It is safe to modify the query in the console without messing up my query as any changes cause a new bookmarked query to be created. So please experiment as that’s the only way to learn.

Now what does the query mean and where do all those magic numbers come from?

  • wdt:P31 means get me all Wikidata triples (wdt) that have the property instance of  (P31) that is has a value of railway company (Q249556).
  • wdt:P17 means get me all of the results so far that have the property country (P17) set to United Kingdom (Q145).

Where did I get those numbers from? First, I went to Wikipedia and searched for a railway company, LMS Railway, and got to the page for London, Midland and Scottish Railway. From here I went to the Wikidata item for the page.

Screen grab of Wikipedia page for LMSR that shows how to get to the Wikidata
Wikipedia page for LMSR that shows how to get to the Wikidata

From here I hovered my pointer over instance ofrailway companycountry and United Kingdom to find out those magic numbers.

Screen grab of the Wikidata page for LMSR
Wikidata page for LMSR

Some unexpected results

Some unexpected companies turned up in the results list due to my query not being specific enough. For example, Algeciras Gibraltar Railway Company, located in Gibraltar but with headquarters registered in the UK the data has its country as United Kingdom. To filter my results down to just those that are located in the UK I tried searching for those that had the located in the administrative territorial entity (P131) with any of the following values:

  • England (Q21) 
  • Northern Ireland (Q26)
  • Scotland (Q22)
  • Wales (Q25)
  • Ireland (Q57695350) (covering 1801 – 1922)

using this query:

SELECT ?company ?companyLabel ?countryLabel
WHERE {
  VALUES ?country { wd:Q21 wd:Q26 wd:Q22 wd:Q25 wd:Q57695350 }
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145; wdt:P131 ?country.
  SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
}
}

ORDER BY (lcase(?companyLabel))

Run the query

However, that dropped my result set from 228 to 25 due to not all the companies having that property set.

Note: When trying to find out what values to use it is often quick and easy to run a simple query to ask Wikidata itself. To find out what all the values were for UK countries I wrote the following that asked for all countries that had an instance of value of country within the United Kingdom (Q3336843):

select ?country ?countryLabel
WHERE {
  ?country wdt:P31 wd:Q3336843 .
  SERVICE wikibase:label {
 bd:serviceParam wikibase:language "en".
  }
}

Run the query

Dates

In order to see what other information could easily be displayed for the companies, I looked at the list of properties on the London, Midland and Scottish Railway. I saw several dates listed so decided that would be my next area of investigation. There is an inception (P571) date that shows when something came into being, so I tried a query with that:

SELECT ?company ?companyLabel ?inception
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  ?company wdt:P571 ?inception 
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query.

This demonstrated two big issues with data. Firstly, the result set had dropped from 228 to 106 indicating that not all the company entries have the inception property set. The second was that only one, Scottish North Eastern Railway, had a full date (29th July 1856) specified, the rest only had a year and that was being displayed as 1st January for the year. Adding the OPTIONAL clause to the inception request returns the full data set with blanks where there is no inception date specified.

SELECT ?company ?companyLabel ?inception
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

Railway companies are not a straightforward case when it comes to a start date due to there being no one single start date. Each railway company required an Act of Parliament to officially enable it to be formed and grant permission to build the railway line(s). This raises the question: is it the date that Act was passed, the date the company was actually formed or the date that the company commenced operating their service that should be used for the start date? Here is a revised query that gets both the start time (P580) and end time(P582) of the company if they have been set:

SELECT ?company ?companyLabel ?inception ?startTime ?endTime
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  OPTIONAL { ?company wdt:P580 ?startTime. }
  OPTIONAL { ?company wdt:P582 ?endTime. }
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

Unfortunately, of the 228 results only one, London, Midland and Scottish Railway, has a startTime and endTime, and London and North Eastern Railway is the only with endTime. Based on these results it looks like that startTime and endTime are not generally used for railway companies. Looking through the data for Scottish North Eastern Railway did turn up a new source of end dates in the form of the dissolved, abolished or demolished (P576) property. Adding a search for this resulted in 9 companies with dissolved dates. 

SELECT ?company ?companyLabel ?inception ?startTime ?endTime ?dissolved
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  OPTIONAL { ?company wdt:P580 ?startTime. }
  OPTIONAL { ?company wdt:P582 ?endTime. }
  OPTIONAL { ?company wdt:P576 ?dissolved. }
  SERVICE wikibase:label {
bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

There is no logic in which companies have this property: they range from Scottish North Eastern Railway dissolving on 10th August 1866 to several that ended due to the formation of British Railways, the more recent British Rail ending on 1st January 2001, and the short lived National Express East Coast (1st January 2007 – 1st January 2009). However, once again, the dates are at times misleading as, in the case of National Express East Coast, it is only the year rather than full date in the inception and dissolved, abolished or demolishedproperties.

Some of the railway companies, such as Underground Electric Railways Company of London, have another source of dates and that is as part of the railway company value for their instance of. It is possible to extract the start and end dates if they are present by making use of nested conditional queries. In the line:

OPTIONAL {?company p:P31 [ pq:P580 ?companyStart]. }

the startTime property is extracted from the instance of property if it exists.

SELECT ?company ?companyLabel ?inception ?startTime ?endTime ?dissolved ?companyStart ?companyEnd
WHERE {
  ?company wdt:P31 wd:Q249556; wdt:P17 wd:Q145 .
  OPTIONAL { ?company wdt:P571 ?inception. }
  OPTIONAL { ?company wdt:P580 ?startTime. }
  OPTIONAL { ?company wdt:P582 ?endTime. }
  OPTIONAL { ?company wdt:P576 ?dissolved. }
  OPTIONAL { ?company p:P31 [ pq:P580 ?companyStart] . }
  OPTIONAL { ?company p:P31 [ pq:P582 ?companyEnd] . }
  SERVICE wikibase:label {
   bd:serviceParam wikibase:language "en".
  }
}
ORDER BY (lcase(?companyLabel))

Run the query

Another date that can be used to work out the start and end of the companies can be found hanging off the values of very useful pair of properties: replaced by (P1366) and replaces (P1365). This conveniently connects into the next part of my exploration that will follow in Part Two. Although, as with many railway related things, the exact time of arrival of part two cannot be confirmed.

[Header photograph taken by Andrew Sage]

Scraping Goes Off The Rails

This post was originally published on 10ml.com by Ian Watt

The art of scraping websites is one beset by difficulties, as I was reminded this week when re-testing a scraper that I built recently.

Schienenbruch

 

Railway performance

As part of my participation in 100 Days of Code I’ve been working on a few projects.

The first one that I tackled was a scraper to gather data from the PDF performance reports which are published on a four-weekly cycle Scotrail’s website. On the face of it this is a straightforward things to do.

  1. Find the link to the latest PDF on the performance page using the label “Download Monthly Performance Results”.
  2. Grab that PDF to archive it. (Scotrail don’t do that – they vanish each one and replace it with a new one every four weeks, so there is no archive).
  3. Use a service such as PDFTables which has an API, uploading the PDF and getting a CSV file in return (XSLX and XML versions are also available but less useful in this project).
  4. Parse the CSV file and extract a number of values, including headline figures, and four monthly measures for each of the 73 stations in Scotland.
  5. Store those values somewhere. I decided on clean monthly CSV output files as a failsafe, and a relational SQLite database as an additional, better solution.

Creating the scraper

So, I built the bones of the scraper in a few hours over the first couple of days of the year. I tested it on the then current PDF which was for period nine of 2016-17. That worked, first creating the clean CSV, then later adding the DB-write routines.

Boom – number 1

I then remembered that I had downloaded the previous period’s PDF. So I modified the code (to omit the downloading routine) and ran it to test the scraping routine on it – and it blew up my code. The format of the table structure in the PDF had changed with an extra blank link to the right of the first list of station names.

After creating a new version and publishing that, I sat back and waited for the publication of period 10 data. That was published in the middle of this week.

Boom – number 2

I re-ran the scraper to add that new PDF to my database – and guess what? It blew up the scraper again. What had happened? Scotrail had changed the structure of the filename of the PDF – from using dashes (as in ‘performance-display-p1617-09.pdf’) to underscores (‘performance_display_p1617_10.pdf’)

That change meant that my routine for sicking out the year and period, which is used to identify database records, broke. So I had to rewrite it. Not a major hassle – but it means that each new publication has necessitated a tweaking of the code. Hopefully in time the code will be flexible enough to accommodate minor deviations from what is expected without manual changes. We’ll see.

We’re ‘doing the wrong thing righter’ – Drucker

Of course, none of this should be necessary.

In a perfect world Scotrail would publish well structured, machine-readable open data for performance. I did email them on 26th November 2016, long before I started the scraper, both asking for past periods’ data and asking if they wanted assistance in creating Open Data. I got a customer service reply on 7th December saying that a manager would be in touch. To date (15 Jan 2017) I’ve had no further response.

The right thing

Abelio operates the Scotrail franchise under contract to the Scottish Government.

Should the terms of such contracts not put an obligation on the companies not only to put the monthly data into the public domain, but also that it be made available as good open data – and follow the Scottish Government’s on strategy for Open Data ? Extending the government’s open data obligation to those performing contracts for governments would be a welcome step forward for Scotland.