Earlier this year Code The City held an Editathon with Wikimedia UK. The subject was the history of Aberdeen Cinemas. We ended up with 16 people all working together to create new articles, update existing ones, capture new images for Wiki Commons, and generate or enhance WikiData items. This was a follow up to previous sessions that Dr Sara Thomas of WikiMedia UK led for us in the city, mainly for information professionals.
This has led to significant interest from cultural bodies in the city in using the suite of WikiMedia platforms and tools to improve access to their collections in Aberdeen. We expect to do quite a bit more of this with them in 2020.
Two weeks ago I attended a Train the Trainer 3-day workshop in Glasgow for Wikimedia UK to become a trainer for them in Scotland. That will see me training professionals and volunteers in how to use Wikipedia, Wiki Commons and Wikidata in particular.
In this blog post I explain why you might want to use some of the fancy features of WikiData query service, show you how to do that, using on my adaptation of others’ shared examples, and encourage you to experiment for yourself.
Wikidata uses a Linked Open Data format to store data. While I have added quite a number of items to Wikidata I’ve not had a chance to really study how to use SPARQL (the query language behind the scenes) to to execute queries against the data. This is done in the Wikidata Query service. This is a key skill to using some of the more advanced features. Without the means to extract data there is little point in stuffing data into it. In fact WikiData allows us to do some very fancy things with the data which we retrieve.
So, I decided this week to start working on that. This describes the first steps I have been doing. It should also provide a simple introduction to any else wanting to dip their toe in the SPARQL waters.
Where to start?
This 16-minute tutorial on Youtube is a great place to begin; it is where I started. It describes how to create a simple query and build it up to something more powerful. I copied what it did then adapted that to build a query that I wanted. I suggest that you watch it first to understand what each line of SPARQL is doing.
Here are the steps, mainly frown from and adapted from that tutorial.
In the query above we use the Educated at statement (P69) and the identifier for Aberdeen University (Q270532 ) in combination with the Sex or gender statement (P21) with the Female identifier (Q6581072).
You can run this for yourself here using the white-on-blue arrow. I’ve used one of the great things you can do with Wikidata which is to share this query using the link symbol on the left of the page just above the arrow:
Changing the parameters of the query means that we can check males (Q6581097) against females (Q6581072). Or you can compare different universities. To do this go to the Wikidata homepage and search for the name of the institution. The query will return a page with the Q code in the title. Thus we can compare various universities by amending the Q code in the query above: University of Aberdeen (Q270532) with University of Glasgow (Q192775) or Edinburgh University (Q160302).
Running these queries we can see that the number of both male and female graduates with entries on WikiData of Aberdeen University is significantly smaller than from either Glasgow or Edinburgh, and we can see that the proportion of females of all graduates for each university is smallest for Aberdeen.
|University||Male Grads||Female Grads||% Female|
The results of these queries should themselves cause us to reflect on the relatively smaller number of results of either gender from Aberdeen compared to the other universities; and also the smaller proportion of women. It suggests that there is some work to do to ensure that we get better representation of both genders in Wikidata.
Enhancing our query
Now that we have a basic query we can retrieve additional bits of data for the subjects of the query including place of birth, date of birth and images.
These are represented by P19 (birth place), P560 (date of birth) and P18 (image). As we see in the example below, when we query these we follow them with a name we assign to the item returned (e.g. ?person wdt:P19 ?birthPlace ) and we add the name we give it, in this case ?birthPlace to the Select statement on the first line of the query, ensuring that it will feature in the data returned in the table or other format output.
You will note that the above example now uses the ?birthPlace to create a new query to get the co-ordinates (P625) of that place which we assign to coordinates:
> ?birthPlace wdt:P625 ?coordinates
and we include coordinates in the first line of things we will display.
Advantages of extra data elements
By having birthplace coordinates we can plot the results in a map which is easily done using the tools built into the wikidata query service.
Run the query (white arrow on blue on the left menu) and observe the table that was returned. You can see that the first line of the Select statement formed the columns of the table.
Note that instead of 125 results as we had in the simple query, we only get 20 results. My understanding of this is that we are specifying records which must have a place of birth, an image etc. Where these do not exist then they records for that person are not returned. This in itself shows that there is a piece of work to do to identify where records in the batch of 125 lack these elements and fix them.
In fact you could say that there is a whole cycle of adding data, querying it, spotting anomalies, fixing those and re-querying which leads to substantial enrichment of the data.
Now click on the dropdown by the eye symbol, on the left immediately above the results, and choose the map option. The tool will generate a map with a pin in the location of each place of birth. You can pan and zoom to the UK and click on each pin. Try it. To get back to the query, click on the arrow, top-right.
Now click on the eye symbol to show other options, and choose Timeline.
As we can see below, the Wikidata query service will construct a rudimentary timeline with relatively little effort. This is one of its great features. So far we have the same 20 complete records – and the cards or tiles are titled by the place of birth but we can change that.
Enhancing the timeline on Histropedia
To improve on our timeline we can construct a better query using the Wikidata Query Service then paste it into the Histropedia service to run it. Our first version which makes small improvements on our previous timeline produces the results below. This labels by the person’s name, and colour codes the individual records by place of birth label. To see the code, click the gear wheel at the top right of the screen. Note we still only retrieve 20 results.
We can substantially enhance this query as we have done on the following version. This makes certain items optional, gets the country of birth and colour-codes by that, and ranks the records by prominence (with the most prominent at the front). If I understand it correctly by using optional elements it also retrieves 76 records, much more than previously.
I would encourage you to watch the tutorial video at the start of this post, then try to hack some of the queries to which I provided links. For example how many female graduates of the Robert Gordon University would each query generate? How would you find the Q code of that institution? Have fun with it!