Scraping – Code The City

Mapping Memorials to Women in Aberdeen

This project, which was part of CTC20, grew from a WMUK / Archaeology Scotland join project carried out by Scottish Graduate School of Arts & Humanities intern Roberta Leotta during lockdown 2020. More details about the background to the project can be found here.

It’s often touted that there are some cities in Scotland (coughEdinburghcough) where there are more statues to animals than there are to women. In my own work transferring OpenPlaques data to Wikidata I’ve observed that there are more entries for Charles Rennie Macintosh than there for women in Glasgow. So in this light, it’s somewhat refreshing to work on a project that celebrates all kinds of memorials to women in Scotland.

The Women of Scotland: Mapping Memorials project began in 2010 as a joint project between Glasgow Women’s Library, and Women’s History Scotland. It’s similar in many ways to OpenPlaques, but using Wikidata could add an extra dimension – let’s increase the coverage of women’s history and culture on the Wikimedia projects by getting these memorials and the women they celebrate into Wikidata, use that to identify gaps in knowledge, and then work to fill the gap.

Over the two days, here’s what we did:

Data collection

We scraped the initial list of data from Mapping Memorials website manually, and created a shared worksheet based on a model that’s been used previously for other cities. (The manual process is slow, and a bit fiddly, and is the one thing that I wouldn’t do again. We’re in contact with the admin so going forward, I’m hopeful that we wouldn’t need to repeat this step in the future.)

Once we had this list, we could create a more automated process to deal with gathering the other pieces of information we needed to create new, good quality Wikidata items, although some (description, for example) needed a human eye.

Wikidata identifiers

We were using two main identifiers on Wikidata – P8048 (Women of Scotland memorial ID) and P8050 (Women of Scotland subject ID). The former for the entries to the memorials themselves, and the latter for the women they celebrate. Where the women didn’t have entries, we could create those, and then link them to the entries for the memorials.

Both identifiers use the last part of the URL for each entry on the Mapping Memorials site, so that was fairly easy to do in Google docs. Once we had that info, it’s an easy enough step to bulk-create items either using Quickstatements or Wikibase CLI.

Creating items & avoiding duplicates

There’s a plug in for Google Sheets called Wikipedia and Wikidata Tools which has some useful features for projects like this – WikidataQID for looking up whether something already exists on Wikidata, and WikidataFacts, which tells you what that item is. The former is ok if you have an exact match, the latter is really useful for flagging anything which might lead to a disambiguation page, for example.

Ultimately we did end up with a few duplicates that needed to be merged, but this was pretty easily managed, and it really showed how useful it is to have local knowledge involved in local projects – there were a couple of sets of coordinates that were obviously wrong, but also some errors that wouldn’t have been spotted by someone unfamiliar with the area.

Coordinates and dates

I really like Quickstatements, but there are a few areas in which it’s fiddly, including coordinates and dates. I’m really interested in looking further into Wikibase CLI for dates in particular, as the process there for dates (documented here) looks to be substantially easier in terms of data prep than it does in Quickstatements. Many thanks to Tony for that work, as his expertise saved us a lot of time! He also used that tool to create items for those women commemorated who were missing from Wikidata, documented here.

As with dates, coordinates are entered into Quickstatements in a different format than that which you’d use manually inside Wikidata itself, hence the formatting you’ll see in column Q on the Data collection tab. Most of this we had to grab from Google Maps, which again is a bit fiddly.

Quickstatements

Once we had a master list of QIDs for the memorials we were working with, we could use Quickstatements to bulk upload sets of statements to those items.

For example, matching the memorials to the women commemorated, using this format:

Screenshot of a spreadsheet showing QID for memorials and the women they commemorate

The Q numbers on the left are those of the memorials, P547 is “commemorates”, and the Q numbers on the right are those of the women celebrated. We were also able to add P8050 (Women of Scotland subject ID) to some women who already had entries on Wikidata, but no WoS ID.

Screenshot of a spreadsheet showing each memorial QID and its type

The Q number on the left again is the memorial, P31 is “instance of”, and the Q number on the right corresponds to a type of thing – a commemorative plaque, a garden, or a road, for example.

Once you’ve got the info in this format, it’s just a case of copy & pasting into QS, clicking import, and then run. (Note – you do need to be an autoconfirmed user to use QS, which means that your account must be at least 4 days old, and having more than 50 edits.) It’s relatively easy, and I was pleased that one of our relatively-new-to-Wikidata participants had the chance to make her first bulk uploads (description & commons category) using the tool over the weekend.

Photos

This project grew out of a desire to increase the coverage of Scottish heritage on Wikimedia Commons, so it was great to take some time on this. Mapping Memorials does have some images, but they’re not openly licensed, and others are missing. After Wikimedia Commons, our next port of call was Geograph, where many images have been released on Wiki-compatible Creative Commons licenses. Using Geograph2Commons, images can easily be transferred over to Wikimedia Commons, so that they can be used in any Wikimedia Project. Geograph also links to this feature from their site – click on “Find out how to reuse this image”, and then scroll down to “Wikipedia template for image page”, then click on the “geograph2commons” link. Really simple. Our group did some detective work for images, and then added them to Commons, and linked them manually to the Wikidata item.

This gave us a list of missing images… which is fine, but wouldn’t it be better to see them on a map?

Visualisation and filling the gaps

Thanks to Ian’s tutorial on how to create a custom WikiShootMe map, we were able to create a custom map that showed us which of the memorials we were working on had images, which didn’t, and where they were. That map is here, and it was great to see it slowly turn more green than red over the weekend as we found more images, or as volunteers headed out across Aberdeen between days to take missing pictures.

A screenshot of a clickable map where people can upload photos of monuments

One of the small, but very satisfying, things you can do with these kinds of images is to integrate them into relevant Wikipedia articles. I added images from the project to the articles for Aberdeen Town House, Caroline Phillips, and Katherine Grainger. At the time of writing, around 2500 people have viewed those articles since I added the images.

Next steps

Over the course of the weekend we added 77 new memorials, and 26 new women to Wikidata, as well as a whole host of new photos. These entries all had some quite rich data, and as complete as we could make it.

We were surprised to see some of the individuals who didn’t have a Wikipedia article – and of course, we can use the Wikidata query service to identify those gaps. The queries below could give us a great starting point for an editathon, or indeed, for any Wikipedia editor interested in writing Women’s biography.

Wikidata query for women with a Women of Scotland subject ID, a memorial in Aberdeen, but no enwiki article: https://w.wiki/YVH
Wikidata query for women with a Women of Scotland subject ID, but no enwiki article: https://w.wiki/YVG

Huge thanks to the team, and to Code the City for another great hack weekend!

Dr Sara Thomas
Scotland Programme Coordinator, Wikimedia UK

——————————————————————————

Header image: The Grave of Jessie Seymour Irvine by Ian Watt on Wiki Commons (CC-BY-SA)

What can I do at Codethecity 11?

We’ve had loads of positive comments that CTC is returning on 25-26 Nov with a Christmas themed hack weekend.

Other than at our very first CTC weekend, we’ve tended to have a theme – health, culture, sport etc. Those themes have put some shape around the weekends’ activities and helped people to identify challenges to work on, projects to tackle and solutions to develop. Which is great.

Who should attend?

Anyone – despite our name, coding is a small bit of what we do. We’d be delighted to see you if

You are a service user, or service provider, who has a problem with service delivery, or sees an opportunity to do things better!
You have an interesting in service design,
You are someone who wants to do more with data but isn’t sure where to start
You are a student (of any discipline),
You are someone who wants to improve their local area, or use digital and skills to improve local services,
You are from the third sector or local government,
You are curious about learning new techniques and skills to use in your day job, and finally
Of course, if you are a developer, designer, UX expert, data wrangler, coders, or service designer.

What to expect

For many of our serial-attendees, they know what to expect and how we work. So they are happy to go with the flow. If you haven’t been to a CTC event before, having a look at those links above will give you an idea of how things go.

Some others have asked ‘why no theme this time?’ – perhaps expecting a more traditional service-type theme.

Well …… there is a theme: FUN.

If you come along, you can interpret that pretty much as you like!

Tell me more

Here are some examples, of some suggestions that we’ve heard. How much fun you consider them is a personal matter!

Bruce has been posting about how he would like to crowd-source a guide to fun things to do in Aberdeen.
Andrew has let it be known that he wants to work on his arduino-powered mini-theatre – and you would be welcome to work on it too.
Steve is threatening to take some robot-artist device for you to programme to sketch ~~rude~~ pictures with.
Ian has suggested that his classmates from RGU get stuck into a resurrected project to scrape FOI data.

There also have been mention of building a Raspberry Pi-powered hadoop cluster, citizen-science style home data scraping kits and a whole bunch more.

Maybe you want to create some software to write (and tweet?) its own cracker-style jokes. Or a platform to lobby councillors to provide better open data, or anything else [We said fun – Ed].

It is entirely up to you. But as you think up some ideas, you might want to consider two things:

Will you manage to pitch the idea at the opening session to others so that they will work with you on it? Building a project team is much more productive than working solo on something. To do the latter you could stay at home.
Would it impress Santa? If you want the funny old fellow to reward your efforts you need to impress him, spread a little happiness, or make something that improves lives.

And, finally …..

If you don’t have an idea to bring, that is cool too. We’ll start the Saturday with some pitch sessions, so listen to others and join a team to work on their project which inspires you.

Don’t delay, get a ticket now! Otherwise, you might be disappointed!

See you there.

Ian, Steve, Andrew, Bruce

CTC8 – Chatbots and AI -final presentations

After two days of intense activity and a whole heap of learning for all of us, Code The City #8, our Chatbots and AI weekend came to an end at tea time on Sunday.

It couldn’t have happened without the generous sponsorship of our two sponsors: The Health Alliance, and Fifth Ring, for which we are very grateful.

The weekend rounded off with presentations of each project, four of which we’ve captured on video (see below).

Each of the projects has its own Github repo. Links are included at the end of each project description. And, two days later, the projects are still being worked on!

Team: ALISS

Team ALISS worked on providing a chatbot interface onto healthcare and social data provided via the ALISS system.

ALISS bot project : Code the City 8 from Andrew Sage on Vimeo.

You can find Project ALISS’s code here on Github.

You can also watch this video of Douglas Maxwell from the Alliance being interviewed about the weekend (although at the time of writing the video is offline due to an AWS problem).

Team: City-consult

This team aimed to make the quality of consultations better through using intelligent chatbot interfaces to guide users through the process – and to provide challenge by prompting citizens to comment on previous consultees’ input.

City-Consult bot project : Code the City 8 from Andrew Sage on Vimeo.

You can find the code for City-Consult at this Github repo.

Team: NoBot

The concept for NoBot came from an initial idea which was of a bot which would make scheduling meetings easier. That spawned the idea – what if the Bot’s purpose was to make you have fewer meetings by challenging you at every turn, and in the process the bot’s personality as a sarcastic gatekeeper was born.

NoBot project : Code the City 8 from Andrew Sage on Vimeo.

The code for Nobot lives here on Github.

Team: Seymour

Sadly there is no video of the wind-up talk for Seymour. In short the purpose of Seymour is to help you keep your houseplants alive. (More details to come).

You can find the code for Seymour at this repo on Github.

Team: Stuff Happens

We started this project with the aim to help citizens find out what was happening in the myriad of local events which we each often seem to miss. Many local authorities have a What’s On calendar, sometimes with an RSS feed. None we found had an API unfortunately.

We identified that by pulling multiple RSS feeds into a single database then putting a bot in front of it, and either through scripting or applying some AI, it should be possible to put potential audiences in touch with what is happening.

Further, by enhancing the collected data – enriching it either manually or by applying machine logic, we could make it more easily navigable and intelligible.

Expect a full write-up of the challenges of this project, and what progress was made, on Ian’s blog,

There is no video, but you an find the project code here on Github.

Team: W[oa]nder

This project set out to solve the problem of checking if a shop or business was still open for the day through a Facebook bot interface – as you with wander around, wondering about the question, as it were.

W[oa]nder bot project : Code the City 8 from Andrew Sage on Vimeo.

You can find their code here.

And finally we were joined by Rory on day two who set out to assist team Stuff-Happens through developing some of the AI around terminologies or categories. That became the:

Word Association Scorer

This is now on Github – not a bot but a set of python functions that scores a given text against a set of categories.

And Finally

We had loads of positive feedback from those who attended the weekend (both old hands and newbies) and from those who watched from afar, following progress on Twitter.

We’ve published the dates for CTC9 and subsequent workshops on our front page. We hope you can join us for more creative fun.

Ian, Andrew, Steve and Bruce
@codethecity

Scraping Goes Off The Rails

This post was originally published on 10ml.com by Ian Watt

The art of scraping websites is one beset by difficulties, as I was reminded this week when re-testing a scraper that I built recently.

Railway performance

As part of my participation in 100 Days of Code I’ve been working on a few projects.

The first one that I tackled was a scraper to gather data from the PDF performance reports which are published on a four-weekly cycle Scotrail’s website. On the face of it this is a straightforward things to do.

Find the link to the latest PDF on the performance page using the label “Download Monthly Performance Results”.
Grab that PDF to archive it. (Scotrail don’t do that – they vanish each one and replace it with a new one every four weeks, so there is no archive).
Use a service such as PDFTables which has an API, uploading the PDF and getting a CSV file in return (XSLX and XML versions are also available but less useful in this project).
Parse the CSV file and extract a number of values, including headline figures, and four monthly measures for each of the 73 stations in Scotland.
Store those values somewhere. I decided on clean monthly CSV output files as a failsafe, and a relational SQLite database as an additional, better solution.

Creating the scraper

So, I built the bones of the scraper in a few hours over the first couple of days of the year. I tested it on the then current PDF which was for period nine of 2016-17. That worked, first creating the clean CSV, then later adding the DB-write routines.

Boom – number 1

I then remembered that I had downloaded the previous period’s PDF. So I modified the code (to omit the downloading routine) and ran it to test the scraping routine on it – and it blew up my code. The format of the table structure in the PDF had changed with an extra blank link to the right of the first list of station names.

After creating a new version and publishing that, I sat back and waited for the publication of period 10 data. That was published in the middle of this week.

Boom – number 2

I re-ran the scraper to add that new PDF to my database – and guess what? It blew up the scraper again. What had happened? Scotrail had changed the structure of the filename of the PDF – from using dashes (as in ‘performance-display-p1617-09.pdf’) to underscores (‘performance_display_p1617_10.pdf’)

That change meant that my routine for sicking out the year and period, which is used to identify database records, broke. So I had to rewrite it. Not a major hassle – but it means that each new publication has necessitated a tweaking of the code. Hopefully in time the code will be flexible enough to accommodate minor deviations from what is expected without manual changes. We’ll see.

We’re ‘doing the wrong thing righter’ – Drucker

Of course, none of this should be necessary.

In a perfect world Scotrail would publish well structured, machine-readable open data for performance. I did email them on 26th November 2016, long before I started the scraper, both asking for past periods’ data and asking if they wanted assistance in creating Open Data. I got a customer service reply on 7th December saying that a manager would be in touch. To date (15 Jan 2017) I’ve had no further response.

The right thing

Abelio operates the Scotrail franchise under contract to the Scottish Government.

Should the terms of such contracts not put an obligation on the companies not only to put the monthly data into the public domain, but also that it be made available as good open data – and follow the Scottish Government’s on strategy for Open Data ? Extending the government’s open data obligation to those performing contracts for governments would be a welcome step forward for Scotland.