You can lead a horse to water but Covid19 data must be scraped

Just over a month ago I wrote about the lack of Covid19 Open Data for Scotland.

I showed how the Italian health authorities were doing just what was needed in the most difficult of circumstances. I explained how, in the absence of an official publication of open data (in an openly-licensed, neutral format, machine-readable format ) I’d taken it on myself on 15th March 2020 to gather and publish the data. My hope was that I’d have to do it for a week or two then the Scottish Government would take over.

And here we are five weeks and one day later and I am still having to do it. Meantime a growing list of websites and applications has developed to use the data which is great but adds to the pressure.

So what’s happened in the intervening time and, more importantly, what hasn’t?

From manual to coded scraping

Originally I was gathering the data manually. Going to the Scottish Government web page and retyping the data into CSVs. This is a terrible practice – open to errors and demanding double and triple checking before pushing to Github. While it looked like my publication was going to be short term that hardly mattered.

But as the weeks dragged by I had to concede that there was no rescue coming from Scottish Government any time soon. Initially it appeared that the daily data was going to be published openly via their statistics platform. That eventually morphed into an additional but different set of data (from National Records of Scotland, not HPS).

So, I resolved to build a scraper – a piece of code that will read the HTML of a webpage and extract the data from that. Sounds easy – but in practice it can be far from it.  And when all is said and done it is the most brittle of solutions: as any small change can break the code.

SG Nested Span Tags
Nested Span Tags on the SG web page

Given how poorly the page was structured (endless nested blank span tags being just one crime against HTML) I didn’t have a great deal of confidence that it could be kept working.

I built it and tested it daily but it wasn’t until 14th April that I was confident enough that it would work daily. Even then it wouldn’t take much to derail it. At that point it was 360 lines of code just to get a few dozens numeric values from a single page.

There is probably some law named after someone wiser than me that says that once you launch a piece of software it will be broken the very next day, and so it did the very next morning. The scraper relies on knowing the structure of a page – finding bulleted lists, tables, and iterating through those structures looking for patterns to match and grabbing the numbers.

Since then the Scottish Government have changed the structure of the page as many as six times, including

  • making the final item in a bulleted list into a new paragraph on its own right,
  • removing a table completely,
  • and today changing the format of numbers in a table to include commas where none were used before.
Text all in bullets
Text all in bullets
Final list item now a para
Final list item now a para

If you are interested I have archived the page contents for each day (minus the styling).

A breakthrough?

Last week it looked like we might have an easier solution on our hands: not only did they change the URL of the page with the data, they then without fanfare added a new XLSX spreadsheet with the daily data in it, updated each day. While not a CSV file, it appeared that it would be very useful.

So yesterday I started to code up a routine to

  • grab the XSLX file,
  • download that,
  • save it as a reference copy, then
  • figure out the worksheet names which have data, not charts,
  • go to those worksheets,
  • find the ranges with the data (ignoring comments in rows above the data, to the right of the data, below the data – see image below),
  • extract that data and write it back to plain CSV files as I was doing wth my original scraper.
A screenshot of one worksheet showing one area of data and three of non-data (red)
A screenshot of one worksheet showing one area of data (green) and three of non-data (red)

Having tested the first part of it yesterday I re-ran it today and it broke. It turns out the URL at which the spreadsheet is published changes from day to day. I suspect that this is as a result of some sort of Content Management System.

All of which means I have to now do another scraper to identify each day’s URL before I can do any of the above.

Why are we in this position?

The current position defies logic. There are so many factors that should have meant that Scottish Government would have this sorted out by now.

  1. I’d identified the need for plain CSV publishing previously, and very publicly, giving good examples.
  2. I’d had an email from contact in SG mentioning two of my CSV files (in the context of the forthcoming NRS data publication).
  3. I work as part of the Civic Society group as part of the Open Government partnership, and I am the lead for open data.
  4. So people know where to find me and interact with me.
  5. I have blogged extensively about what we need – and have emailed contacts at SG.
  6. As part of the planning for Scottish Open Data Unconference I set up a Slack group to which SG contacts were invited – and I believe some signed up.
  7. So there are forms through which, if there was any uncertainty, anyone at SG could ask “what does the data community want?” or “we’re thinking about doing ‘x’ would that work for you?” But there has been no such approach.

Meantime, I’ve spent many 10s of hours for no financial reward doing what Peter Drucker called ‘doing the wrong thing righter.” i.e. allowing the SG to continue to publish data wrongly on the basis that the effort is transferred onto the community to create work-arounds.

After five weeks of doing this daily, I’m absolutely fed up of it. I’m going to formally raise it through the Open Government Partnership with a view to getting the right data, in the right format, published daily.

 

Header picture cropped from a photo by Nathan Dumlao on Unsplash