Ten Years After

“Hear me calling, hear me calling loud, 
If you don’t come soon, I’ll be wearing a shroud.” – Ten Years After (1969)

Introduction

Today marks the tenth anniversary of my involvement with Open Data in Scotland. As I wrote here, back in 2009-2010 I’d been following the work that Chris Taggart and others were doing with open data, and was inspired by them to  create what I now believe to have been the first open data published in the public sector in Scotland.

This piece is a reflection of my own views. These views may be the same as those held by colleagues at Code The City or indeed on the civic side of the Open Government Partnership. I’ve not specifically asked other individuals in either group.

While my involvement in, and championing of, open data in Scotland is now a decade long, my enthusiasm for the subject and in the the social and economic benefits it can deliver, is undiminished by my leaving the public sector in 2017 after thirty four years. In fact the opposite is true: the more I am involved in the OD movement, and study what is being achieved beyond Scotland’s narrow borders, the more I am convinced that we are a country intent on squandering a rich opportunity, regardless of our politicians’ public pronouncements.

But the journey has not been easy. primarily due to a lack of direction from Scottish Government and little commitment, resource or engagement at all levels of public service. A friend who reviewed this blog post suggested that I should replace the picture of a birthday cake (above) with one of a naked human back bearing bleeding scars from the our battles. He’s right –  it is STILL a battle ten years on.

It is not as if the position in Scotland is getting better. We are moving at a glacial pace. The gap between Scotland and other countries in this regard is widening. I gave a talk earlier this year in which I showed assessments of Scotland, Romania and Kenya’s performance in Open Government (source: https://www.opengovpartnership.org/campaigns/global-report/ Vol 2) and asked the audience to identify which was Scotland.

Extracts from https://www.opengovpartnership.org/campaigns/global-report/ Vol 2
Extracts from Vol2 to of the Open Gov Partnership report

Show full version of graphic

Question: Which is Scotland? (Answer)

Economic opportunities

In February 2020 the European Data Portal published a report – The Economic Impact of Open Data – which sets out a clear economic case for open data. That paper looks at 15 previous studies between 1999 and 2020 which have examined at the market size of open data at national and international levels, measured in terms of GDP of each study’s geographical area.

Taking the average and median values from those reports (1.33%  and 1.19% respectively) and an estimated GDP for Scotland (2018) of £170.4bn we can see that the missed opportunity for Scotland is of the order of £2.027bn to £2.266bn per annum.  What is the actual value of the local market created by Scottish-created open data? if pushed for a figure I would estimate that it is currently worth a few hundred thousand pounds per annum, and no more. Quite a gap!

Meantime we have the usual suspect of consultants whispering sweetly in the ears of ministers, senior civil servants and council bosses that we should be monetising data, creating markets, selling it. There will be no mention, I suspect of the heavily-subsidised, private sector led, yet failed Copenhagen Data Exchange, I suspect. (Maybe they can make a few bob back selling the domain name! )

You can buy the failed CityDataExchange.com for just $5195
You can buy the failed CityDataExchange.com for just $5195

While this commercial approach to data may plug small gaps in annual funding for Scotland, and line the pockets of some big companies in the process, it won’t deliver the financial benefits at a national level of anything like the figures suggested by that EU Data Portal report but it will, in the process, actively hamper innovation and inhibit societal benefits.

I hear lots of institutions saying “we need to sell data” or “we need to sell access rights to these photos” or similar. Yet, in so many cases, the operation of the mechanisms of control; the staffing, administration, payment processing etc. far outstrips any generated income. When I challenged ex colleagues in local government about this behaviour their response was “but our managers want to see an income line”  to which we could add “no matter how much it is costing us.” And this tweet from The Ferret on Tuesday of this week is another excellent example of this!

I have also heard lots of political proclamations of “open and transparent” government in Scotland since 2014. Yet most of the evidence points in exactly the opposite direction. Don’t forget, when Covid 19 struck, Scotland’s government was reportedly the only political administration apart from Bolsonaro’s far right one in Brazil to use the opportunity to limit Freedom of Information.

Openness, really?

It is clear that there is little or no commitment to open data in any meaningful way at a Scottish Government level, in local authorities, or among national agencies. This is not to say that there aren’t civil servants who are doing their best, often fighting against political or senior administration’s actions.  Public declarations are rarely matched by delivery of anything of substance and conversations with people in those agencies (of which I have had many) paints a grim picture of political masters saying one thing and doing another, of senior management not backing up public statements of intent with the necessary resource commitment and, on more than occasion, suggestions of bad actors actually going against what is official policy.

I mention below that I joined the Open Government Partnership late in 2019. Initially I was enthusiastic about what we might achieve. While there are civil servants working dedicatedly on open government who want to make it work, I am unconvinced about political commitment to it. We really need to get some positive and practical demonstration that Scottish Government are behind us – otherwise I and the other civil society representatives are just assisting in an open-washing exercise.

In my view (and that of others) the press in Scotland does not provide adequate scrutiny and challenge of government. We have a remarkably ineffective political opposition. We also have a network of agencies and quangos which are reliant on the Scottish Government for funding who are unwilling to push back. All of this gives the political side a free pass to spout encouraging words of “open and transparent” yet do the minimum at all times.

We may have an existing Open Data Strategy for Scotland (2015) stating that Scotland’s data is “open by default”, yet my 2019 calculation was that over 95% of the data that could and should be open was still locked up. And there is little movement on fixing that.

We have many examples of agencies doing one thing and saying another, such as  Scottish Enterprise extolling the virtues of  Open Data yet producing none. Its one API has been broken for many months, I am told.

My good friends at The Data Lab do amazing work on funding MSc and Phd places, and providing funding for industrial research in the application of data science. Their mission is “to help Scotland maximise value from data …” yet they currently offer no guidance on open data, no targeted programme of support, no championing of open data at all, despite the widely-accepted economic advantages which it can deliver. There is the potential for The Data Lab to lead on how Scotland makes the most of open data and to guide government thinking on this!

All of this is not to pick on specific organisations, or hard working and dedicated employees within them. But it does highlight systemic failures in Scotland from the top of government downwards.

Fixing this is an enormous task: one which can only be done by the development of a fresh strategy for open data in Scotland, which is mandated for all public sector bodies, is funded as an investment (recognising the economic potential), and which is rigorously monitored and enforced.

I could go on…. but let’s look at this year’s survey.

(skip to summary)

Another year with what to show for it?

In February 2019 I conducted a survey of the state of open data in Scotland. It didn’t paint an encouraging picture. The data behind that survey has been preserved here. A year on, I started thinking about repeating the review.

In the intervening year I’d been involved in quite a bit activity around open data. I had

  • joined the civic side of the Open Government group for Scotland and was asked to lead for the next iteration of the plan on Commitment Three (sharing information and data) ,
  • joined the steering group of Stirling University’s research project, Data Commons Scotland,
  • trained as a trainer for Wikimedia UK, delivering training in Wikidata, Wikipedia and Wiki Commons, and running multiple sessions for Code The City with a focus on Wikidata,
  • created an open Slack Group  for the open data community in Scotland to engage with one another,
  • created an Open Data Scotland twitter account which has gained almost 500 followers, and
  • initiated the first Scottish Open Data Unconference (SODU) 2020 which had been scheduled to take place as a physical event in March this year. That has now been reconfigured as an online unconference which will happen on 5th and 6th September 2020.

In restarting this year’s review of open data publishing in Scotland my aims were to see what had changed in the intervening 12 months and to increase the coverage of the survey: going broader and deeper and developing an even more accurate picture. That work spilled into March at which point Covid-19 struck. During lockdown I was distracted by various pieces of work. It wasn’t until August, and with a growing sense of the imminence of this 10-year anniversary, that I was galvanised to finish that review.

I am conscious that the methodology employed here is not the cleverest – one person counting only the numbers of datasets produced.  This is something I return to later.

The picture in 2020

I broke the review down into sectoral groupings to make it more managable to conduct. By sticking to that I hope to make this overview more readable. The updated Git Hub repo in which I noted my findings is available publicly, and I encourage anyone who spots errors or omissions to make a pull request to correct them. Each heading below has a link to the Github page for the research.

Overall there is little significant positive change. This is one factor which gives rise to concerns about government’s commitment to openness generally and open data specifically; and to a growing cynicism in the civic community about where we go from here.

Local Government

(Source data here)

I reviewed this area in February 2020 and rechecked it in August.  Sadly there has been no significant change in the publication of open data by local government in the eighteen months since I last reviewed this. More than a third of councils (13 out of a total of 32) still make no open data provision.

While the big gain is that Renrewshire Council have launched a new data portal with over fifty datasets, most councils have shown little or no change.

Sadly the Highland Council portal, procured as part of the Scottish Cities Alliance’s Data Cluster programme at £10,000’s cost, has vanished. I dont think it ever saw a dataset being added to it. Searching Highland Council’s website for open data finds nothing.

While big numbers of data sets don’t mean much by themselves, the City of Edinburgh Council has a mighty 236 datasets. Brilliant! BUT … none of them are remotely current. The last update to any of them was September 2019. Over 90% of them haven’t been updated since 2016 or earlier.

Similarly Glasgow, which has 95 datasets listed have a portal which is repeatedly offline for days at a time. A portal which won’t load is useless.

Dundee, Perth and Stirling continue to do well. Their offerings are growing and they demonstrate commitment to the long-haul.

Aberdeen launched a portal, more than three years in the planning, populated it with 16 datasets and immediately let their open data officer leave at the end of a short-term contract. Some of their datasets are interesting and useful – but there was no consultation with the local data community about what they would find useful, or deliver benefits locally; all despite multiple invitations from me to interact with that community at the local data meet-ups which I was running in the city.

It was hoped that the programme under the Scottish Cities alliance would yield uniform datasets, prioritised across all seven Scottish Cities, but there is no sign of that happening, sadly. So what you find on all portals or platforms is pretty much a pot-luck draw.

Where common standards exist – such as the 360 Giving standard for the publication of support for charities – organisations should be universally adopting these. Yet this is only used by two of 32 authorities, all of whom have grant-making services. Surely, during a pandemic especially,  it would be advantageous to funders and recipients to know who is funding which body to deliver what project?

Councils – Open Government Licence and RPSI

This is a slight aside from the publication of open data, but an important one. If the Scottish Authorities were to adopt an OGL approach to the publication of data and information on their website (as both the Scottish Government’s core site and the Information Commissioner for Scotland do) then we would be able to at least reuse data obtained from those sites. This is not a replacement for publishing proper open data but it would be a tiny step forward.

The table below (source and review data here)  shows the current permissions to reuse the content of Scottish Local Authorities’ websites. Many are lacking in clarity, have messy wording, are vague or misunderstand terminologies. They also, in the main, ignore legislation on fair re-use.

Table of local authority adoption of PGL and RPSI
Table of local authority adoption of PGL and RPSI
Open Government Licence

The Scottish Government’s own site is excellent and clear: permitting all content except logos to be be reused under the Open Government Licence. This is not true for local authorities. At present only Falkirk and Orkney Councils – two of the smaller ones – allow, and promote OGL re-use of content. There is no good reason why all of the public sector, including local government, should not be compelled to adopt the terms of OGL.

Re-use of Public Sector Information (RPSI) Regulations

Since 2015 the public sector has been obliged by the RPSI Regulations to permit reasonable reuse of information held by local authorities. So, even if Scottish LAs have not yet adopted OGL for all website content, they should have been making it clear for the last five years how a citizen can re-use their data and information from their website.

In my latest trawl through the T&Cs and Copyright Statements of 32 Scottish Local Authorities, I found only 7 referencing RPSI rights there, with 25 not doing so (see the full table above). I am fairly sure that these authorities are breaking the legal obligation on public bodies to provide that information.

Finally, given the presence of COSLA on the Open Government Scotland steering group, the situation with no open data; poor, missing or outdated data; and OGL and PRSI issues needs to be raised there and some reassurance sought that they will work with their member organisations to fix these issues.

Health

(Source data here)

The NHS Scotland Open Data platform continues to be developed as a very useful resource. The number of datasets  there has more than doubled since last year (from 26 to 73).

None of the fourteen Health Boards publish their own open data beyond what is on the NHS Scotland portal.

Only one of the thirty Health and Social Care Partnerships (HSCPs) publish anything resembling open data: Angus HSCP.

COVID-19 and open data

While we are on health, I’ve wrote (here and here) early in the pandemic about the need for open data to help the better public understanding of the situation, and stimulate innovative responses to the crisis. The statistics team at Scottish Government responded well to this and we’ve started to develop a good relationship. I’ve not followed that up with a retrospective about what did happen. Perhaps I will in time.

It was clear that the need for open data in CV19 situation caught government and health sector napping. The response was slower than it should have been and patchy, and there are still gaps. People find it difficult to locate data when it is on muliple platforms, spread across Scots Govt, Health and NRS. That is, in a microcosm, one of the real challenges of OD in Scotland.

With an open Slack group for Open Data Scotland there is a direct channel that data providers could use to engage the open data community on their plans and proposals. They could also to sound out what data analysts and dataviz specialists would find useful. That opportunity was not taken during the Covid crisis, and while I was OK in the short term with being used as a human conduit to that group, it was neither efficient nor sustainable. My hope is that post SODU 2020, and as the next iteration of the Open Gov Scotland plan comes together we will see better, more frequent, direct engagement with the data community on the outside of Government, and a more porous border altogether.

Further and Higher Education

(Source data here)

There is no significant change across the sector in the past 18 months. The vast majority of institutions make no provision of open data. Some have vague plans, many of them historic – going back four years or more – and not acted on.

Lumping Universities and Colleges together, one might expect at a minimum properly structured and licensed open data from every institution on :

  • courses
  • modules
  • events
  • performance (perhaps some of this is on HESA and SFC sites?)
  • physical assets
  • environmental performance
  • KPI targets and achievements etc.

Of course, there is none of that.

Universities and colleges

I reviewed open data provision of Universities and Colleges around 17 February 2020. I revisited this on 11 August 2020, making minor changes to the numbers of data sets found.

While five of fifteen universities are publishing increasing amounts of data in relation to research projects, most of which are on a CC-0 or other open basis, there continues to be a very limited amount of real operational open data across the sector with loads of promises and statements of intent, some going back several years.

The Higher Education Statistics Agency publishes a range of potentially useful-looking Open Data under a CC-BY-4.0 licence. This is data about insitutions, course, students etc – and not data published by the institutions themselves. But I could identify none of that. Overall, this was very disappointing.

Further, while there are 20 FE colleges. None produces anything that might be classed as open data. A few have anything beyond vague statement of intent. Perhaps City of Glasgow College not only comes closest, but does link to some sources of info and data.

The Crighton Observatory

While doing all of this, I was reminded of the Crighton Institute’s Regional Observatory which was announced to loud fanfares in 2013 and appears to have quietly been shut down in 2017. Two of the team involved say in their Linked In profiles that they left at the end of the project. Even the domain name to which articles point is now up for grabs (Feb 2020).

It now appears (Aug 2020) that the total initial budget for the project was >£1.1m. Given that the purpose of the observatory was to amass a great deal of open data,  I have also attempted to find out where the data is that it collected and where the knowledge and learning arising from the project has been published for posterity? I can’t locate it. This FOI request may help. The big question: what benefits did the £1.1m+ deliver?

Scottish Parliament

(Source data here).

In February 2019 I found that The Scottish Parliament had released 121 data sets. This covers motions, petitions, Bills, petitions and other procedural data, and is very interesting. This year we find that they have still 121 data sets, so, there are no new data sources.

In fact that number is misleading. In February 2020  I discovered that while 75 of these have been updated with new data, the remaining 46 (marked BETA) no longer work. As of August 2020 this is still the case. Why not fix them, or at worst clear them out to simplfy the finadbility of working data?

Some of these BETA datasets should contain potentially more interesting / useful data e.g. Register of Members Interests but just don’t work. Returning: [“{message: ‘Data is presently unavailable’}”]

I didn’t note the availability of APIs last year, but there are 186 API calls available. Many of these are year-specific. I tested half a dozen and about a third of those returned error messages. I suspect some of these align with the non-functioning historic BETAs.

Sadly the issues raised a year ago about the lack of clarity of the licensing of the data is unchanged. To find the licence, you have to go to Notes > Policy on Use of SPCB Copyright Material. Following the first link there (to a PDF) you see that you have to add “Contains information licenced under the Scottish Parliament Copyright Licence.” to anything you make with it, which is OK. But if you go to the second link “Scottish Parliament Copyright Licence” (another PDF) the wording (slightly) contradicts that obligation. It then has a chunk about OGL but says, “This Scottish Parliament Licence is aligned with OGLv3.0” whatever that means. Why not just license all of the data under OGL? I can’t see what they are trying to do.

Scottish Government

(Source data here)

Trying to work out the business units within the structure of Scottish Government is a significant challenge in itself. Attempting to then establish which have published open data, and what those data sets are, and how they are licensed, is almost an impossible task. If my checking, and arithmetic are right, then of 147 discrete business units, only 27 have published any open data and 120 have published none.

So we can say with some confidence  that the issue with findability of data raised in Feb 2019 is unchanged, there being no central portal for open data in the Scottish public sector or even for Scottish Government. Searching the main Scottish Government website for open data yields 633 results, none of which are links to data on the first four screenfuls. I didn’t go deeper than that.

The Scottish Government’s Statistics Team have a very good portal with 295 Data Sets from multiple organisational-providers. This is up by 46 datasets on last year and includes a two new organisations: The Care Inspectorate and Registers of Scotland. The latter, so far (Aug 2020), has no datasets on the portal.

There are some interesting new entrants into the list of  those parts of Scottish Government publishing data such as David MacBrayne Limited which is, I believe, wholly owned by SG and is the parent, or operator of Calmac Ferries Limited.  On 1st March 2020 they released a new data platform to get data about their 29 ferry routes. This is very welcome. After choosing the dates, routes and traffic types you can download a CSV of results. While their intent appears to be to make it Open Data, the website is copyright and there is no specific licensing of the data. This is easily fixable.

It is also interesting to contrast Transport Scotland with work going on in England. Transport Scotland’s publication scheme says of open data “Open data made available by the authority as described by the Scottish Government’s Open Data Strategy and Resource Pack, available under an open licence. We comply with the guidance above when publishing data and other information to our website. Details of publications and statistics can be found in the body of this document or on the Publications section of our website.” I searched both without success for any OD. Why not say “we don’t publish any Open Data”? Compare this complete absence of open data with even the single project Open Bus Data for England. Read the story here. Scotland is yet again so far behind!

Summary

In the review of data I’ve shown that little has changed in 18 months. Very few branches of government are publishing open data at all. The landscape is littered with outdated and forgotten statements of good intent which are not acted on; broken links; portals that vanish or don’t work; out of date data; yawning gaps in publication and so on.

The claim of “Open By Default” in the current (2015) Open Data Strategy is misleading and mostly ignored with consequence.  The First Minister may frequently repeat the mantra of “Open and Transparent” when speaking or questioned by journalists, but it is easily demonstrable that the administration frequently act in the directly opposite way to that.

The recent situations with Covid-19 and the SQA exams results show Scotland would have found itself in a much better place this year with a mature and well-developed approach to open data: an approach one might have reasonably expected after five full years of “open by default”.

The social and economic arguments for open data are indisputable. These have been accepted by most other governments of the developed world. Importantly, they have also been taken up and acted on by developing nations who have in many cases overtaken Scotland in their delivery of their Open Government plans.

The work I have done in 2019 and in this review is not a sustainable one – i.e. one single volunteer monitoring the activity of every branch and level of government  in Scotland. And the methodology is limited to what is achievable by an individual.

A country which was serious about Open Data would have targets and measures, monitoring and open reporting of progress.

  • It wouldn’t just count datasets published. It would be looking at engagement, the usefulness of data and its integration into education.
  • It would fund innovation: specifically in the use of open data; in the creation of tools; in developing services to both support government in creating data pipelines, and in helping citizens in data use.
  • It would co-develop and mandate the use of data standards across the public sector.
  • It would develop and share canonical lists of ‘things’ with unique identifiers allowing data sets to be integrated.
  • It would adopt the concept of data as infrastructure on which new products, services, apps, and insights could be built.

I really want Scotland to make the most of the opportunities afforded by Open Data. I wouldn’t have spent ten years at this if I didn’t believe in the potential this offers; nor if I didn’t have the evidence to show that this can be done. I wouldn’t be giving up my time year-on-year researching this, giving talks, organising groups and creating opportunities for engagement.

What is fundamentally lacking here is some honesty from Scottish Government ministers instead of their pretence of support for open data.

 

Ian Watt
20 August 2020

Link to an index of pieces I have written on Open Data:
http://watty62.co.uk/2019/02/open-data-index-of-pieces-that-i-have-written/

Answer to quiz

Scotland is B, in the centre. Kenya is A, and Romania C.
I could have chosen Mexico, Honduras, Paraguay, Uruguay – or others. All are doing better than Scotland.

Back up to the quiz

Header Image by David Ballew on Unsplash.

Aberdeen Built Ships

This project was one of several initiated at the fully-online Code the City 19 History and Data event.

It’s purpose is to gather data on Aberdeen-built ships, with the permission of the site’s owners, and to push that refined bulk data, with added structure, onto Wikidata as open data, with links back to the Aberdeen Ships site through using a new identifier.

By adding the data for the Aberdeen Built Ships to Wikidata we will be able to do several things including

  • Create a timeline of ship building
  • Create maps, charts and graphs of the data (e.g. showing the change in sizes and types of ships over time
  • Show the relative activity of the many shipbuilders and how that changed
  • Link ship data to external data sources
  • Improve the data quality
  • Increase engagement with the ships database.

The description below is largely borrowed from the ReadMe file of the project’s Github Repo.

Progress to date

So far the following has been accomplished, mainly during the course of the weekend.

Next Steps?

To complete the project the following needs to be done

  • Ensure that the request for an identifier for ABS is created for use by us in adding ships to Wikidata. A request to create an identifier for Aberdeen Ships is currently pending.
  • Create Wikidata entities for all shipbuilders and note the QID for each. We’ve already loaded nine of these into WikiData.
  • Decide on how to deal with the list of ships that MAY be already in Wikidata. This may have to be a manual process. Think about how we reconcile this – name / year / tonnage may all be useful.
  • Decide on best route to bulk upload – eg Quickstatements. This may be useful: Wikidata Import Guide
  • Agree a core set of data for each ship that will parsed from ships.json to be added to Wikidata – e.g. name, year, builder, tonnage, length etc
  • Create a script to output text that can be dropped into a CSV or other file to be used by QuickStatements (assuming that to be the right tool) for bulk input ensuring links for shipbuilder IDs and ABS identifiers are used.

We will also be looking to get pictures of the ships published onto Wiki Commons with permissive licences, link these to the Wiki Data and increase and improve the number of Wikipedia articles on Aberdeen Ships in the longer-term.

Header Image of a Scale Model of Thermopylae at Aberdeen Maritime Museum By Stephencdickson – Own work, CC BY-SA 4.0

You can lead a horse to water but Covid19 data must be scraped

Just over a month ago I wrote about the lack of Covid19 Open Data for Scotland.

I showed how the Italian health authorities were doing just what was needed in the most difficult of circumstances. I explained how, in the absence of an official publication of open data (in an openly-licensed, neutral format, machine-readable format ) I’d taken it on myself on 15th March 2020 to gather and publish the data. My hope was that I’d have to do it for a week or two then the Scottish Government would take over.

And here we are five weeks and one day later and I am still having to do it. Meantime a growing list of websites and applications has developed to use the data which is great but adds to the pressure.

So what’s happened in the intervening time and, more importantly, what hasn’t?

From manual to coded scraping

Originally I was gathering the data manually. Going to the Scottish Government web page and retyping the data into CSVs. This is a terrible practice – open to errors and demanding double and triple checking before pushing to Github. While it looked like my publication was going to be short term that hardly mattered.

But as the weeks dragged by I had to concede that there was no rescue coming from Scottish Government any time soon. Initially it appeared that the daily data was going to be published openly via their statistics platform. That eventually morphed into an additional but different set of data (from National Records of Scotland, not HPS).

So, I resolved to build a scraper – a piece of code that will read the HTML of a webpage and extract the data from that. Sounds easy – but in practice it can be far from it.  And when all is said and done it is the most brittle of solutions: as any small change can break the code.

SG Nested Span Tags
Nested Span Tags on the SG web page

Given how poorly the page was structured (endless nested blank span tags being just one crime against HTML) I didn’t have a great deal of confidence that it could be kept working.

I built it and tested it daily but it wasn’t until 14th April that I was confident enough that it would work daily. Even then it wouldn’t take much to derail it. At that point it was 360 lines of code just to get a few dozens numeric values from a single page.

There is probably some law named after someone wiser than me that says that once you launch a piece of software it will be broken the very next day, and so it did the very next morning. The scraper relies on knowing the structure of a page – finding bulleted lists, tables, and iterating through those structures looking for patterns to match and grabbing the numbers.

Since then the Scottish Government have changed the structure of the page as many as six times, including

  • making the final item in a bulleted list into a new paragraph on its own right,
  • removing a table completely,
  • and today changing the format of numbers in a table to include commas where none were used before.
Text all in bullets
Text all in bullets
Final list item now a para
Final list item now a para

If you are interested I have archived the page contents for each day (minus the styling).

A breakthrough?

Last week it looked like we might have an easier solution on our hands: not only did they change the URL of the page with the data, they then without fanfare added a new XLSX spreadsheet with the daily data in it, updated each day. While not a CSV file, it appeared that it would be very useful.

So yesterday I started to code up a routine to

  • grab the XSLX file,
  • download that,
  • save it as a reference copy, then
  • figure out the worksheet names which have data, not charts,
  • go to those worksheets,
  • find the ranges with the data (ignoring comments in rows above the data, to the right of the data, below the data – see image below),
  • extract that data and write it back to plain CSV files as I was doing wth my original scraper.
A screenshot of one worksheet showing one area of data and three of non-data (red)
A screenshot of one worksheet showing one area of data (green) and three of non-data (red)

Having tested the first part of it yesterday I re-ran it today and it broke. It turns out the URL at which the spreadsheet is published changes from day to day. I suspect that this is as a result of some sort of Content Management System.

All of which means I have to now do another scraper to identify each day’s URL before I can do any of the above.

Why are we in this position?

The current position defies logic. There are so many factors that should have meant that Scottish Government would have this sorted out by now.

  1. I’d identified the need for plain CSV publishing previously, and very publicly, giving good examples.
  2. I’d had an email from contact in SG mentioning two of my CSV files (in the context of the forthcoming NRS data publication).
  3. I work as part of the Civic Society group as part of the Open Government partnership, and I am the lead for open data.
  4. So people know where to find me and interact with me.
  5. I have blogged extensively about what we need – and have emailed contacts at SG.
  6. As part of the planning for Scottish Open Data Unconference I set up a Slack group to which SG contacts were invited – and I believe some signed up.
  7. So there are forms through which, if there was any uncertainty, anyone at SG could ask “what does the data community want?” or “we’re thinking about doing ‘x’ would that work for you?” But there has been no such approach.

Meantime, I’ve spent many 10s of hours for no financial reward doing what Peter Drucker called ‘doing the wrong thing righter.” i.e. allowing the SG to continue to publish data wrongly on the basis that the effort is transferred onto the community to create work-arounds.

After five weeks of doing this daily, I’m absolutely fed up of it. I’m going to formally raise it through the Open Government Partnership with a view to getting the right data, in the right format, published daily.

 

Header picture cropped from a photo by Nathan Dumlao on Unsplash

Scotland’s Covid-19 Open Data

We are in unprecedented times. People are trying to make sense of what is going on around them and the demands for up to date, even up-to-the-minute,  information is as never before. Journalists, data scientists, immunologists, epidemiologists and others are looking for data to use to develop that information for the broader public, as well as to feed into predictive modelling. That means that governments and Health Services at all levels (UK and Scotland) need to be publishing that data quickly, consistently, and in a way that makes it easy for the data users to consume it. They need to look at best practice and quickly adopt those standards and approaches.

Let’s start with what this post is not. It is not a criticism of some very hard pressed people in NHS Scotland and Scottish Government who are trying very hard to do the right thing.

So, what is it? It is an honest suggestion of how the Scottish Government must adapt in how it publishes data on the most pressing issue of modern times.

The last five days

Last Sunday, 15th March, as the number of people in Scotland with Covid-19 started to climb in Scotland (even if numbers were still low in comparison to other EU countries) I went looking for open data on which I could start to plan some analysis and visualisation. And I found none.

What I did find was a static HTML webpage. This had the figures for that day:  the  total number of tests conducted, the total number of negative results, and the number of positive cases for each Health Board. This page is then overwritten at 2pm the next day. This is an awful practice, also used by Scotrail to hide its performance month on month.

I was able, using the Internet Wayback machine, to fill in some gaps back to 5th March but that was far from complete. I published what I could on GitHub and mentioned that on Twitter and in a couple of Slack Groups. Thankfully a friend, Lesley, was ahead of me in terms of data collection for her work as a data journalist, and was able to furnish testing data back to the start on 24 January 2020. Since then I’ve updated the GitHub repo daily – usually when the data is published at 2pm.

Almost immediately I began, a couple of people started to build visualisations based on what I had put in GitHub including this one. Some said that they were waiting for the numbers to climb to more significant levels, particularly deaths before they would start to use the data.

Two or three times the data has been published then corrected with some test results for Shetland / Grampian being reassigned between the two. This is understandable given the current circumstances.

SG webpage with table of Covid19 daily cases
SG webpage with table of Covid19 daily cases

On 19 March 2020, the 2pm publication was delayed, with the number of fatalities, and positive results being published after 3.30pm and the total number of tests being published after 7pm. Again – this is undertandable. The present circumstances are unprecedented, process are being developed. Up to now much of Scotland’s open data publication has been done, if at all, at a more leisurely and considered pace. It does make one wonder how, as the numbers rise exponentially, as they surely will, how the processes will cope.

Why is this important?

At this time the public are trying to make sense of a very difficult situation. Journalists, scientists and others are trying to assist in that by interpreting what data there is for them, including building visualisations of that. People are also seeking reassurances – that the UK and Scottish Government are on top of the situation. Transparency around government activity such as testing, and the spread of the virus, would build trust. Indeed there is real concern that Scotland, and the UK as a whole, is not meeting WHO guidance on testing and tracing cases.

But with a static web page, with limited range of data that is erased daily, this is not possible. Even setting up a scraper to grab the essential content from that page is not feasible if the data is only partially published for long periods.

We have some useful data visualisations such as this set by Lesley herself. What can be done is limited. Deaths per health board are are collected, we’ve been told, but they are not published – only a Scotland-level total.

I’ve had it confirmed by someone I know in the Scottish Government that they are looking at creating and posting Linked Open Data which I suspect will be on their platform, which is a great resource but which is seen by many as a barrier to actually getting data quickly and simply.

Italian government GitHub repo
Italian government GitHub repo

Compare this with the Italian Government who have won plaudits from the data science, journalism and developer communities for making their data available quickly and simply using GitHub  as the platform. This is one that is familiar to the end-users. They also have a great range of background information (look at it in Chrome which will translate it). On that platform they publish daily national and regional statistics for

  • date
  • state
  • hospitalised with symptoms
  • intensive care
  • total hospitalised
  • home isolation
  • total currently positive
  • new currently positive
  • discharged healed
  • deceased
  • total cases
  • swabs tests.

Not only is the data feeding the larger, world-wide analysis such as that by Johns Hopkins University, but people at a national level are using that data to create some compelling, interactive visualisations such as this one. As each country starts to recover and infections and deaths start to slow, having ways o visualising that depends on data to drive those views.

[edited] Wouldn’t a dashboard such as this one for Singapore, built by volunteers, be a good thing for Scotland? We could do it with the right data supplied.

Singapore dashboard
Singapore dashboard

[/edited]

So, this is a suggestion, or rather a request, to NHS Scotland and the Scottish Government to put in place a better set of published data, which is made available in as simple and as timely a fashion as can be accomplished under the present circumstances. Give us the data and we’ll crowd-source some useful tools built on it.

How to do that?

The Scottish Government should look to fork one of the current repositories and using that as a starting point. In an ideal world that would be the Italian one – but even starting with my simple one (if the former is too much) would be a step forward.

Also, I would encourage the government to get involved in the conversations that are already happening – here for example in the Scottish Open Data group.

There is a large and growing community there, composed of open data practitioners, enthusiasts and consumers, across many disciplines, who can help and are willing to support the government’s work in this area.

Aberdeen Plaques – Part Two

In part one I described what we did at CTC18 to capture data and images of Commemorative Plaques in Aberdeen, and what I then did in the following three weeks.

A few people asked my why we would bother to put plaques into Wikidata and WikiCommons in this way. Why not have a council website – or why not use Open Plaques?

In this second instalment I am going to demonstrate how we can use the data which we have created to make some interesting visualisations and even do some calculations and analysis.

It can also power other new apps and services – allowing developers to create tailored routes around the city, on themes such as the arts or medicine – which is beyond the scope of this post.

Getting Started

At the time of writing we now have 132 Aberdeen Commemorative Plaques recorded  in Wiki Data.

I can check that with this simple query on the Wiki Data Query Service:

Plaques - Query One
Plaques – Query One

All that this does is ask for every instance (P31) of a commemorative plaque (Q721747) whcich is located in (P131) the Aberdeen City (Q62274582) area.

Try It for yourself.

Click on the white-on-blue arrow at the left. See what it produces. Note the bottom half of the screen turns into a table of results, and on the centre bar there is a message ‘xxx results in xxxx milliseconds‘.

How many pictures of plaques?

I can retrieve the photograph for plaque using the following query.

Plaques - Query Two
Plaques – Query Two

Here I am saying give us plaques which have image (P18). In effect this is saying ONLY those that have an image. If not all entries have an image, yet, then we will get a smaller number.

Try it.

As I run it I get 126 – which is six fewer than I got plaques.

Get all plaques with images or not

Let’s modify the query to this.

Plaques - Query Three
Plaques – Query Three

Here I am the OPTIONAL command which has the effect of saying IF there is an image give me it, but don’t restrict the results to only those with images. When we run that we can spot the missing ones by scrolling down through the list. I get six plaques with no images. This is a useful technique to spot missing things when totals (in this case plaques and images) don’t tally.

Try it.

Commemorating who or what?

As it stands the query is still not very user-friendly as all we have for the plaques is their Plaque ID. Of course we can click on those, but it would be more helpful to have the names of their subjects.

We’ll do that in two steps.

Firstly, let’s work out what the subjects are.

We can add the following line to the query and remember to add ?subject to the SELECT on the first line.

 ?plaque wdt:P547 ?subject

Note P547 is the statement “commemorates“.

Try it

If we run that we get a new column called subject and it is filled with links to subject IDs, which are the Wikidata entries for either people or things that the plaques commemorates. I note that when I run it my list has grown from 132 to 134.

Any guesses why that should be?

Some of the plaques commemorate more than one person.

Let’s make it a bit more friendly.

Add the following line just before the end of your query

 SERVICE wikibase:label {bd:serviceParam wikibase:language "en". }

And change ?subject to ?subjectLabel in the first line.

This instructs the WikiData Query service to use another service to retrieve labels from the items.

Plaques - Query Four
Plaques – Query Four

The label is in effect the title of the Wikidata item. Look at this one https://www.wikidata.org/wiki/Q80818579 Immediately below the title, and to the left, there is an edit link. Click that. See how the ‘label‘ and the ‘description immediately below it become editable. Cancel that for now.

Try running that query to get subject names (labels) back

Now we have a name (in a subjectLabel column) for who or what is being commemorated.

Which provosts have plaques?

We can ask which of our plaques commemorates a previous Lord Provost of Aberdeen.

We use the P547 (commemorates) statement to get our subject, then use the following

subject wdt:P39 wd:Q57906938.

where P39 is Position Held, and Q57906938 is the identifier for Lord Provost of Aberdeen.

Plaques - provosts?
Plaques – provosts?

Currently we appear to have four plaques to former Lord Provosts.

Note: the “Try it” link below has been updated to take  account of subsequent work done to separate Provosts and Lord Provosts into separate categories.

Try it

A different view

At this point you might want to change the view for your query just to have a look at the images we have.

Above the table of results, on the extreme left there is an eye symbol and a drop down. Choose “Image Grid” to see the images only.

Plaques - change view
Plaques – change view

You might also have noticed that there are other options, several of which are greyed out as we don’t yet have that data in our query. These views include ‘Map‘ and “Timeline‘. We’ll come back to those.

Our Image Grid looks something like this:

Plaques - Image Grid
Plaques – Image Grid

Remember to swap back to ‘Table’ view once you’ve finished.

Adding more data fields

We can now add more data fields to our query.

Firstly, let’s add the geographic coordinates of the plaques’ locations.

Add the following line to your code:

 OPTIONAL {?plaque wdt:P625 ?coordinates .}

and, again add the new value, ?coordinates to the first line of the query too.

You will now have an extra field in the returned data table.

Try it 

Mapping results

Now change the view from Table to Map. The Wikidata query service automatically uses the coordinates to plot the results on a map which is scaled to show the results. You may need to scroll down to see all of the map. Click on one of the plotted points. You should get a pop up with the name of the person or building commemorated, plus a photo of the plaque itself, as shown below.

Plaques - map view
Plaques – map view

Note – if you add the following as the first line of your query, it will default to a map view rather than table when first run.

#defaultView:Map

Now let’s see if we can get more data for the people for whom there are plaques.

Dates of birth and death

We can change our query to find out if there are dates of birth and death for our human subjects  (rather than buildings).

We can use P569 (date of birth) and P570 (date of death) and ascribe those to
?DOB and ?DOD respectively – again, adding those fields to our SELECT statement on line one. Your query should look like this?

Plaques - Query Five
Plaques – Query Five

Try it

Looking at our table of results we can see that we have a mix of types of results – people, bridges, buildings etc. but only the people have dates.

Table showing dates of birth
Table showing dates of birth

Interestingly the one subject with the DOB and DOD in the screenshot above is Elizabeth Crombie Duthie who gifted Duthie Park to the city of Aberdeen.

Remember, if you change the DOB and DOB from being OPTIONAL to just being regular requests, you can filter records to show ONLY those with dates associated with them which will screen out not only non-human subjects but will exclude any people with incomplete or missing dates.

Notable people

It could be argued that the fact there is a plaque to a person would indicate that they are notable, but not every person or object for which there is a plaque has a Wikipedia article. Let’s add some code to see which of our plaques has an associated article.

Plaques - Query Six
Plaques – Query Six

Try It

Changing the above so that we remove the OPTIONAL {} around the section beginning ?article  we get ONLY those with Wikipedia articles which is, as I run it, 79 plaque subjects.

You can if you want we add the following

 ?subject wdt:P31 wd:Q5 .

where P31 (instance of ) is Q5 (human) we can screen out all of the non-people plaques.

Try it

At this point, try flipping the view to TimeLine – you may have to scroll down quite a way to see all of the plaques. Many of them are concentrated at the right, spanning much of the 20th century. You should see John Barbour (1316-1395 at the extreme left).

Plaques - timeline
Plaques – timeline

Finally, before we start doing some statistical analysis let’s try something more sophisticated.

Can we create a map showing only female subjects whose work was in the medical sciences?

To do that we need to select only subjects who have a P21 (gender or sex) of Q6581072 (female). Then we need to select an occupation (P31) which is an instance or subclass of Q66811410 (the medical profession). This requires a structure that we haven’t see before:

?occupation wdt:P31/wdt:P279* wd:Q66811410

While we are at it, let’s get an image of the subject if there is one, and find out of there is a wikipedia article about the subject. And, since we want a map, we add that as our default view at the top.

Plaques - map of female medics
Plaques – map of female medics

This gives us the following output:

Map view of female medics
Map view of female medics

Try it

Changing this query to male (Q6581097) or choosing different types of professions is straightforward.

Statistical analysis

The Wikidata Query Service allows us to move beyond visualising the data in different ways. Let’s have a look at a couple of examples.

Analysing who or what is commemorated

The following query finds out what the subject of the plaque is an instance of (P31) – line 6:

Plaque - query seven
Plaque – query seven

but instead of creating a list, it use the COUNT () function to analyse the subject being an instance of (P31) Instance Of.

Try it

We can see that we have 105 humans, 5 lanes etc. Note that some double counting occurs. Some structures, for example, are instances of two things.

We can also analyse the gender of the human subjects just by changing P31 in the above to P21 (Sex or Gender).

At present I get

Plaques by gender
Plaques by gender

That’s far from gender equality, isn’t it!

What’s in a name?

Ascertaining the most common first names on plaques is also straightforward.

We use P735 (given name) statement, get the labels, count and group by those.

Try it.

We get the following results

Plaques - given names chart
Plaques – given names chart

With 81% of plaques to people being for males it is hardly surprising that our league table of names begins with James, William, George, John, Alexander ….

We can do more sophisticated analysis too.

Analysing Occupations

We can add the following line to our query to get back the occupation of the subject of the plaque:

 ?subject wdt:P106 ?occupation

Bear in mind that many of our plaque subjects are true polymaths. Have a look at Robert Brown. He has 10 listed occupations!

So what are the most common occupations of those people for whom there are plaques? Any guesses?

Let’s use the following query:

Plaques - Using Count()
Plaques – Using Count()

This uses the COUNT () function as well as a GROUP BY clause. The query looks at all of the different occupation labels, counts how many of each there are.

Try it

This returns, by default, a table of values. We can flip to a Bar Chart to make better sense of the data:

Plaques - Bar Chart of occupations
Plaques – Bar Chart of occupations

So, we can see that for those commemorated by a plaque the most common occupations are Physician, Painter, University Lecturer, Writer and so on.

We can add a couple of refinements if we wish. If we want our query to default to a BarChart when we run it we can add the following line at the start of the query:

#defaultView:BarChart

and if we want the table to be sorted by value we can add a line such as

ORDER BY DESC (?count)

Try it

What next?

Over the last month I’ve been busy gathering data, taking photographs and publishing all of those on WikiData and wiki Commons. That phase is not quite complete, if it ever could be considered complete. You can monitor live progress here.

There are a couple of photographs which I can’t easily take which I know Aberdeen City Council’s Museum and Galleries team have. It would be great to see those made available by them on Wiki Commons, as I have shared the 148 plaque photos I have taken.

I know of at least 24 more plaques which I have photographed which are not listed yet in Wikidata.

When I published part one of this series I got some great feedback on Twitter. One suggestion is that we add structured data to the Wiki Commons pages for each photograph. Another was to add further data to the record for each plaque using statement P276 (location) where the plaque is on a known listed building. So far I have done that for 5 plaques – check it for yourself. There are loads more to do.

Many of the people records that I have created in Wikidata are skeletal. They need more detail, photographs, biographical links etc. Similarly, given that people or places are noteworthy enough to merit a plaque, they should pass the notability test for Wikipedia, yet at least 68 plaque subjects have no Wikipedia entry.

And plaques are just a start – an easy introduction to what is possible given, in this case, about 100 hours of work. While that was almost all done by one person, if we ran a Code The City weekend on a similar theme and similar sized challenge, six people could achieve the same over a weekend with a little coordination.

At Code The City, we’re about to start discussions with the local cultural institutions about setting up a more formal alliance for the city (shire?) to help shape how they use digital and data more effectively and grow volunteers with skills and tools to make that happen, which is an exciting note on which to finish this post! Watch this space, as they say.

Ian