2 – 3 October SODU 2021 The return of Scottish Open Data Unconference
December (dateTBC) #CTC24
Note – all of our planned events are currently being run fully online. Everything we do is free to attend. While tickets for events have a suggested donation of £5 to help with charity running costs, this should not be a barrier to anyone attending.
To get advanced notice of our events, and make sure of a place, why not sign-up for our bi-monthly, spam-free, mailing list?
CTC23 – The OD Bods
Blog post coming soon. Just fixing the URL for now.
At Code The City 22 we started Meet Your Next MSP, a project to list hustings for the Scottish Parliamentary election. The team comprised of James Baster and Johnny Mckenzie.
James Baster had prior experience working on a similar project for the UK general election in 2015, where they listed over 1000 events in a project that was cited by many charities and campaigns. This showed him that there was interest in such a project. It also showed that many people don’t even know what a hustings areis, so the project deliberately tries to be accessible in order to introduce others to these type of events.
Thanks to Johnny who wrangled data from National Records of Scotland to make a dataset that mapped postcodes to areas; vital for powering the postcode lookup box on the home page of the site.
Storing data in a git repository is an interesting approach; it has some drawbacks but some advantages (moderation by pull requests and a full history for free). Crucially, it’s not a new idea and is something many people already do so it will be interesting to learn more about this approach.
Since the hackathon, the website has been tweaked, the Google form replaced with a better custom form and the website is now live!
We will run this over the next month and see how this goes.
And after the general election, the lessons won’t be lost. What we are essentially building are tools that let a community of people list events of interest together, with the data stored in a git repository. We think this tool could be applicable to many different situations.
The public have access to two free, easy accessible waste recycling and disposal methods. The first is “kerbside collection” where a bin lorry will drive close to almost every abode in the UK and crews will (in a variety of different ways) empty the various bins, receptacles, boxes and bags. The second is access to recycling centres, officially named Household Waste Recycling Centres (HWRCs) but more commonly known as the tip or the dump. These HWRCs are owned by councils or local authorities and the information about these is available on local government websites.
However, knowledge about this second option: the tips, the dumps, the HWRCs, is limited. One of the reasons for that is poor standardisation. Council A will label, map, or describe a centre one way; Council B will do it in a different way. There is a lot of perceived knowledge – “well everybody just looks at their council’s website, and everybody knows you can only use your council’s centres”. This is why at CTC22 we wanted to get all the data about HWRCs into a standard set format, and release it into the open for communities to keep it present and up to date. Then we’d use that data to produce a modern UI so that residents can actually get the information they require:
Which tips they can use?
When these dumps are open?
What can they take to these HWRCs?
“I have item x – where can I dispose of it?”
There were six main tasks to complete:
Get together a list of all the HWRCs in the UK
Build an open data community page to be the centre point
Bulk upload the HWRCs’ data to WikiData
Manually enter the HWRCs into OpenStreetMap
Create a website to show all the data
Create a connection with OpenStreetMap so that users could use the website to update OSM.
What we built / did
All HWRCs are regulated by a nation’s environmental regulator:
For Scotland it is SEPA
For Northern Ireland it is NIEA
For Wales it is NRW
For England it is EA
A list of over 1,000 centres was collated from these four agencies. The data was of variable quality and inconsistent.
This information was added to a wiki page on Open Street Map – Household waste in the United Kingdom, along with some definitions to help the community navigate the overly complex nature of the waste industry.
From that the lists for Scotland, Wales and England were bulk uploaded to WikiData. The was achieved by processing the data in Jupiter Notebooks, from which formatted data was exported to be bulk uploaded via the Quick Statements tool. The NIEA dataset did not include geolocation information so future investigation will need to be done to add these before these too can be uploaded. A Wikidata query has been created to show progress on a map. At the time of writing 922 HWRCs are now in Wikidata.
Then the never-ending task of locating, updating, and committing the changes of each of the OSM locations was started.
To represent this data the team built a front-end UI with .NET Core and Leaflet.js that used Overpass Turbo to query OSM. Local Authority geolocation polygons were added to highlight the sites that a member of the public could access. By further querying the accepted waste streams the website is able to indicate which of those centres they can visit can accept the items they are wanting to recycle.
However, the tool is only as good as the data so to close the loop we added a “suggest a change” button that allowed users to post a note on that location on OpenStreetMap so the wider community can update that data.
We named the website OpenWasteMap and released it into the wild.
The github repo from CTC22 is open and available to access.
What we will do next (or would do with more time/ funding etc)
The next task is to get all the data up-to-date and to keep it up to date; we are confident that we can do this because of the wonderful open data community. It would also be great if we could improve the current interface on the frontend for users to edit existing waste sites. Adding a single note to a map when suggesting a change could be replaced with an edit form with a list of fields we would like to see populated for HWRCs. Existing examples of excellent editing interfaces in the wild include healthsites.io which provides an element of gamification and completionism with a progress bar with how much data is populated for a particular location.
While working through the council websites it has become an issue that there is no standard set of terms for household items, and the list is not machine friendly. For example, a household fridge can be called:
Large Domestic Electrical Appliance
A “fun” next task would be to come up with a taxonomy of terms that allows easier classification and understanding for both the user and the machine. Part of this would include matching “human readable” names to relevant OpenStreetMap tags. For example “glass” as an OSM tag would be “recycling:glass”
There are other waste sites that the public can used called Bring Banks / Recycling Points that are not run by Local Authorities that are more informal locations for recycling – these too should be added but there needs to be some consideration on how this information is maintained as their number could be tenfold that of HWRCs.
As we look into the future we must also anticipate the volume of data we may be able to get out of sources like OpenStreetMap and WikiData once well populated by the community. Starting out with a response time of mere milliseconds when querying a dozen points you created in a hackathon is a great start; but as a project grows the data size can spiral into megabytes and response times into seconds. With around 1,000 recycling centres in the UK and thousands more of the aforementioned Bring Banks this could be a lot of data to handle and serve up to the public in a presentable manner.
Saturday 6th March, 2021 was World Open Data Day. To mark this international event CTC ran a Wikidata Taster session. The objectives were to introduce attendees to Wikidata and how it works, and give them a few hours to familiarise themselves with how to add items, link items, and add images.
The theme of the session (to give it some structure and focus) was the Industrial Heritage of Aberdeen. More specifically the bygone industries of Aberdeen and, more specific still, the many Iron Foundries that once existed. I chose the specific topic as it is still relatively easy to spot the products of the industry on streets and pavements as we walk around the city, photograph those and add them to Wiki Commons, as I have been doing.
We had thirteen people book and eight turn up. After I gave a short presentation on how Wikidata operates we divided ourselves into three groups in breakout rooms. This was all on Zoom, of course, while we were still under lockdown.
The teams of attendees chose a foundry each: Barry, Henry & Cook Limited; Blaikie Brothers, and William McKinnon & Company Ltd. I’d already created an entry for John Duffus and Company in preparation for the event and to use as a model.
I’d also created a Google Sheet with a tab for each of the other thirteen foundries I’d identified (including those selected by the groups). I’d also spent quite a while trying to figure out how to access and search the old business and Post Office Directories for the city which had been digitised for 1824 to 1941. I eventually I built myself a tool, which I shared with the teams, which generated an URL for a specific search term for a certain directory. They used this, as well as other sources, to identify key dates, addresses and name changes of businesses.
By the end of the session our teams had created items for
They had also created items for foundry buildings – linked to Canmore etc, as well as founders. We enhanced these with places of their burial, portraits and images of gravestones. I took further photos which I uploaded to Commons and linked the following Monday. I created two Wikidata queries to show the businesses added, and the founders who created the businesses.
The statistics for the 3 hour session (although some worked into the afternoon and even the next day) are impressive. You can see more detail on the event dashboard.
We received positive feedback from the attendees who have been able to take their first steps towards using Wikidata as a public linked open data for heritage items.
I hope that the attendees will keep working on the iron founders until we have all of these represented on Wikidata. Next we can tackle shipbuilders and the granite industry!
Once data was found, the next stage was finding out the licensing rights and whether or not the data could be downloaded and legitimately reused. The data found on Canmore’s website indicated that it was provided under an Open Government Licence hence could be uploaded to Wikidata. This is the data source which was then used on day two of the project.
A training session on how to use Wikidata was also required on day one to allow the team to understand how to upload the data to Wikidata and how the identifiers etc worked.
Day two – cleaning and uploaded the data to Wikidata.
Deciding on the identifiers to use in Wikidata was the starting point, then the data had to be cleaned and manipulated. This involved translating Easting and Northings coordinates to latitude and longitude, matching the ship types between the Canmore file and Wikidata, extracting the reference to the ship from Canmore’s URL and general overall common sense review of the data. To aid with this work a Python script was created. It produced a tab separated file with the necessary statements to upload to Wikidata via Quickstatements.
The team members were new to Wikidata and were unable to create batch uploads as they didn’t have 4 days since creating their accounts and 50 manual edits to their credit – a safeguard to stop new accounts creating scripts to do damage.
We asked Ian from Code The City to assist, as he has a long editing history. He continues this blog post.
I downloaded the output.txt file and checked if it could be uploaded straight to Quickstatements. It looked like there were minor problems with the text encoding of strings. So I imported the file into Google Docs. There, I ensured that the Label, Description and Canmore links were surrounded in double quotation marks. A quick find and replace did this.
I tested an upload of five or six entries and these all ran smoothly. I then did several hundred. That turned up some errors. I spotted loads of ships with the label “unknown” and every wreck had the same description. I returned to the Python script and tweaked it to concatenate the word “Unknown” with a Canmore ID. This fixed the problem. I also had to create a checking method of seeing if our ship had already been uploaded. I did this by downloading all the matching Canmore IDs for successfully uploaded ships. I then filtered these out before re-creating the output.txt file.
I then generated the bulk of the 24,185 to be uploaded. I noticed a fairly high error rate. This was due to a similar issue to the Unknown-named ships. The output.txt script was trying to upload multiple ships with the same names (e.g. over 50 ships with the name Hope). I solved this in the same manner as with Unknown-named wrecks, concatenating ship names with “Canmore nnnnnn.”
I prepared this even as the bulk upload was running. Filtering out the recently uploaded ships and re-running the creation of the Output.txt file meant that within a few minutes I was able to have the corrective upload ready. Running this a final time resulted in all shipwrecks being added to WIkidata, albeit with some issues to fix. This had taken about a day to run, refine and rerun.
The following day I set out to refine the quality of the data. The names of shipwrecks had been left in sentence case: an initial capital and everything else in lower case. I downloaded a CSV of records we’d created, and changed the Labels to Proper Case. I also took the opportunity to amend the descriptions to reflect the provenance of the records from Canmore in the description of each. I set one browser the task of changing Labels, and another the change to descriptions. This was 24,185 changes each – and took many hours to run. I noticed several hundred failed updates – which appear to just be “The save has failed” messages. I checked those and reran them. Having no means of exporting errors from Quickstatements (that I know of) makes fixing errors more difficult than it should be.
Finally I noticed by chance that a good number of records (estimated at 400) are not shipwrecks at all but wrecks of aircraft. Most, if not all, are prefixed “A/C’ in the label.
I created a batch to remove statements for ships and shipwrecks and to add statements saying that these are instances of crash sites. I also scripted the change to descriptions identifying these as aircraft wrecks rather than ship wrecks.
I’ve noted the following things that the team could do to enhanced and refine the data further:
Check what other data is available by download or scraping from Canmore (such as date of sinking, depth, dimensions) and add that to the wikidata records
Attempt to reconcile data uploaded from Aberdeen built ships at CTC19 with these wrecks – there may be quite a few to be merged
Finally, in the process of working on the cleaning of this uploaded data I noticed the the data model on Wikidata to support this is not well structured.
This was what I sketched out as I attempted to understand it.
Before I changed the aircraft wrecks to “crash site” I merged the two items which works with the queries above. But this needs more work.
Should the remains of a crashed aircraft be something other than a crash site? The latter could be cleared of debris and still be the crash site. The term Shipwreck more clearly describes where a wreck is whether buried, on land, or beneath the sea.
Why is a shipwreck a facet of a ship, but a crash site is a subclass of aircraft.
And Disaster Remains seems like the wrong term for what might be a non-disastrous event (say if a ship from the middle ages gently settled into mud over the centuries and was forgotten about – and certainly isn’t a subclass of Conservation Status, anyway.
I’d be happy to work with anyone else on better working out an ontology for this.