Scraping Goes Off The Rails

This post was originally published on 10ml.com by Ian Watt

The art of scraping websites is one beset by difficulties, as I was reminded this week when re-testing a scraper that I built recently.

Schienenbruch

 

Railway performance

As part of my participation in 100 Days of Code I’ve been working on a few projects.

The first one that I tackled was a scraper to gather data from the PDF performance reports which are published on a four-weekly cycle Scotrail’s website. On the face of it this is a straightforward things to do.

  1. Find the link to the latest PDF on the performance page using the label “Download Monthly Performance Results”.
  2. Grab that PDF to archive it. (Scotrail don’t do that – they vanish each one and replace it with a new one every four weeks, so there is no archive).
  3. Use a service such as PDFTables which has an API, uploading the PDF and getting a CSV file in return (XSLX and XML versions are also available but less useful in this project).
  4. Parse the CSV file and extract a number of values, including headline figures, and four monthly measures for each of the 73 stations in Scotland.
  5. Store those values somewhere. I decided on clean monthly CSV output files as a failsafe, and a relational SQLite database as an additional, better solution.

Creating the scraper

So, I built the bones of the scraper in a few hours over the first couple of days of the year. I tested it on the then current PDF which was for period nine of 2016-17. That worked, first creating the clean CSV, then later adding the DB-write routines.

Boom – number 1

I then remembered that I had downloaded the previous period’s PDF. So I modified the code (to omit the downloading routine) and ran it to test the scraping routine on it – and it blew up my code. The format of the table structure in the PDF had changed with an extra blank link to the right of the first list of station names.

After creating a new version and publishing that, I sat back and waited for the publication of period 10 data. That was published in the middle of this week.

Boom – number 2

I re-ran the scraper to add that new PDF to my database – and guess what? It blew up the scraper again. What had happened? Scotrail had changed the structure of the filename of the PDF – from using dashes (as in ‘performance-display-p1617-09.pdf’) to underscores (‘performance_display_p1617_10.pdf’)

That change meant that my routine for sicking out the year and period, which is used to identify database records, broke. So I had to rewrite it. Not a major hassle – but it means that each new publication has necessitated a tweaking of the code. Hopefully in time the code will be flexible enough to accommodate minor deviations from what is expected without manual changes. We’ll see.

We’re ‘doing the wrong thing righter’ – Drucker

Of course, none of this should be necessary.

In a perfect world Scotrail would publish well structured, machine-readable open data for performance. I did email them on 26th November 2016, long before I started the scraper, both asking for past periods’ data and asking if they wanted assistance in creating Open Data. I got a customer service reply on 7th December saying that a manager would be in touch. To date (15 Jan 2017) I’ve had no further response.

The right thing

Abelio operates the Scotrail franchise under contract to the Scottish Government.

Should the terms of such contracts not put an obligation on the companies not only to put the monthly data into the public domain, but also that it be made available as good open data – and follow the Scottish Government’s on strategy for Open Data ? Extending the government’s open data obligation to those performing contracts for governments would be a welcome step forward for Scotland.

Chatbots and AI – #CTC8

Code the City #8, which will take place in on Sat 25th to Sunday 26th February 2017, will be an exploration of the world of chatbots and AI (or Artificial Intelligence), identifying problems to tackle and quickly prototyping solutions.

>>> Book a ticket on our  Eventbrite page 

What are chat bots?

A chatbot is a piece of software that interacts with a customer or user to directly answer their questions. It uses existing data or information coupled with artificial intelligence to respond in a human-like way, guiding the user to a solution.

There are many examples of live chat bots in this exciting, emerging field. A chatboat could give you travel directions, tell you when its next going to rain in your area, or help you contest parking tickets. It could book you a flight and hotel, or act as a free lawyer to help the homeless get housing . The HBO series Westworld has even launched a bot to help you interact with the (fictional) holiday park!

If you are new to this field and want to get started we suggest you read the Complete Beginners Guide to Chatbots (and some of the links at the end of this article).

Example Travel Bot
Example Travel Bot
Example Waste Bot
Example Waste Bot

How will the weekend run?

We’ll apply our usual  Code The City methodology:

  • Bring together a diverse range of people from various backgrounds, to form teams.
  • Identify problems that we’d like to apply chatbots to solve.
  • Identify approaches,  information and data, to guide how we develop the bots and train them
  • Mix academic thinking, and user need, with open source technology and open data to develop new services
  • Iterate quickly through approaches, testing ideas, failing quickly and refining our approaches.
  • Prototype and demonstrate solutions to an interested audience

Who should attend?

  • Service owners – and service providers
  • Academics and students in the field of chatbots and artificial intelligence
  • Coders
  • Data specialists
  • Front-end and UX designers
  • Bloggers and social media practitioners
  • Anyone with an interest in getting involved in creating bots even for fun!

What you will do?

You will create mixed teams to workshop chatbot solutions to real world issues.  Maybe these will building on the outputs of previous work we’ve done at CodeTheCity. Through rapid prototyping you will create new applications and have some fun in the process.

We’ll show you new techniques for service design, idea generation, prototyping, and rapid iterative application development – and you will show other participants some tricks and approaches, too. We’ll share knowledge and learning.

You might even get a Tshirt, and we can guarantee the best catering of any weekend workshop in the city!

To book a free ticket visit our Eventbrite page   But be quick, tickets will go swiftly!

All attendees will get a year’s free membership of the Open Data Institute.

You can find out more about the previous events on tumblr, on the eventifier, and on flickr.

If you have any questions please get in touch.

How can I support this event?

If you are interested in sponsoring this event please, or providing other support such as access to online tools or services, please  get in touch.

Useful Articles and Resources

>>> Book a ticket on our  Eventbrite page

Journeygrid

“We should build one for here!”

So starts another Codethecity conversation on discovering a neat data driven tool. This time it’s the excellent New York subway toy created by Jason Wright.

Brand_New_Subway

The tool allows you to redesign transit provision in the city by building new subway routes. By adding new stations. By removing or moving existing lines.

It’s addictive and fascinating.

As is so often the case, we then start riffing on what it could also do. It could time travel using that tram data we have from the early 1900s. It could give alternate route options if we hook up to that academic project we spoke with earlier in the year. It could carbon count. It could give safety information for cyclists. We could data collect with a new app to feed it improved validation data…

Before we have the cake we’re discuss how pretty the icing will look.

In reality what we should be looking at is the bottom layer. The underpinnings.  The data.

Where do people live? Where do they work? Where do they school run? Where is the football stadium and where do the fans live? Where are the shops and where is the money?

We’re going to start with the commute. Where do people start, spend, and end their day? How do they move around? And when? No agenda. No grand insights planned. Just a good solid data gathering and modelling project.

We’re calling it journeygrid.

journeygrid open data transportation project

If you have any data, or methodologies for gathering and storing such data we’d love to speak to you.

You can find out more about the New York Subway project here, and you can play with it here.

Tourism Hack – Perth – TBC

PLEASE NOTE – Due to low take-up this event has been postponed. We are sorry for any inconvenience this will cause. 

Perth wants to boost its tourism offer and wants some help!. They want to see whether some well developed apps could help the city and its wider area bring attractions, trails, events, culture,accommodation, eateries; and activities to life.

They are also interested in bringing the quirky and interesting aspects of the city together, using great images and interesting user generated content through social media.

==================================================
=
= Update
=
= DATA sOURCES aDDED On Github
=
==================================================

They have developed the website http://www.perthcity.co.uk/ and there is an app (http://www.mi-perthshire.co.uk/ ) but want some creative minds to take a fresh look at the city and surrounding area, generate new ideas that they could then develop into some new apps, open data or other projects.

As always we’re looking for coders, designers, data wranglers, service users and providers, bloggers – in fact anyone with an interest – to join us for a weekend of ideation, creation, open data and rapid prototyping.

We’ll feed you, keep you stimulated, and provide good wifi. You will leave with a sense of accomplishment, new skills and potentially new friends.

Accommodation.

We’ve uploaded a list of hotels in this Perth City Accommodation List.

In addition there are a cluster of B&BS on Dunkeld Road.

Also, just outside the city itself, The Lodge at the Perth Racecourse are offering a flat rate of £90 per night in a Double or Twin bedded room (£45 per person), which also includes a full breakfast. See  http://perthlodge.co.uk/dining