2. Problem - more local scrutiny
data than we can keep up with
1 - solo reporters/journalists/bloggers
2 - papers with (sadly) diminishing
editorial staff
3 - civil society
4 - councils themselves
3. The trends aren’t going to reverse
Massive increase in local accountability (devolution)
Increases in local data
People won’t scrutinise this stuff themselves
Armchair auditors didn’t work
Resources in media declining
Bad for communities, democracy
How can we do more scrutiny, with fewer resources
6. Local News Engine
Prototype funded by Google Digital News Initiative – e50,000
AT LAST I CAN BUY IN
PROPER CODERS
Open Data Services Co-operative – world class
7. Pile up the newsworthy scrutiny data
My patch covers parts of two central London boroughs - Camden (very large) and Islington (very small)
Data about building or altering houses, opening or changing pubs, bars and clubs, sex shops, gambling
establishments, people due to be in court.
Camden planning applications - data store download
Camden commercial licensing - scraped
Islington planning applications - scraped
Islington commerical licening - scraped
Magistrates Court list (upcoming cases) - parsed from pdf to data
This is novel (we think)
8. Datastore v scrapers – no contest
on the time that the scrapers take to run, and the range of data that’s included in
them. In order to speed up the scrapers and to ensure that the data was
comparable, we spun up some VMs on Google Compute Engine to run the
scrapers.
Camden License: 38.4h runtime, data back to 2005
Camden Planning: 2 min runtime, data back to 2010
Islington License: 39.5h runtime, data back to 2006
Islington Planning: 16.2h runtime, data back to 2006
9. Sort out the newsworthy people
By names - a newsworthy person appearing in a newsworthy data set could be
newsworthy. (very) literally everyone who has been in the newspaper is
newsworthy
Performed entity extraction on Camden New Journal and Islington Tribune,
producing all the names of people and companies who had been in it.
Run geospatial search for all data with addresses in target area
Run list of 1,000-odd names from entity extraction as a search
11. Sort newsworthy places
By place - simple things happening in some places are news in their own right - eg
a planning application or someone in court.
Users have wide definition of what is an interesting place - for some the whoel
borough, others a particular ward/street
All the data has reasonable address information
Define area of interest by wards (for now this can be more precise to SOAs)
Perform geospatial search
12. Data Issues - DPA exemptions
Data is published by arms of government for public scrutiny. Special purposes exemption in DPA covers processing:
‘This exemption protects freedom of expression in journalism, art and literature (which are known as the ‘special purposes’).
The scope of the exemption is very broad and it can exempt from most provisions of the DPA, including subject access –
but never principle 7 or the section 55 offence (unlawful obtaining etc of personal data).
However it does not give an automatic blanket exemption. In order for the exemption to apply:
the data must be processed only for journalism, art or literature,
it must be being processed with a view to publication,
you must have a reasonable belief that the publication is in the public interest, and
you must have a reasonable belief that compliance with the DPA is incompatible with journalism, art or literature.
13. Data issues - access and licensing
Council data mainly had to be scraped - only one dataset in a modern data store.
Data therefore not licenced properly, asked council, they relaxed
Courts pdfs can be accessed by a journalist with reasonable reason. But each
court varies. Courts info very sensitive - contains juveniles, cases with reporting
restrictions etc. Must be handled with great care - contempt and no fault
defamation.
British principle of open justice behind access, but poorly implemented.
14. Issues and questions
Ethics (for citizens) – extension of journalism ethics as scrutiny becomes
Despite open data accessibility of local data is rubbish
Still requires good coding skills – ODS world class – code on Talk About Local
Github
Court lists – Japanese puffer fish of data
15. Sorting Criteria (emerging)
Names
Broadly based on proper noun (‘named entity’) extraction from CNJ and Islington Tribune and people who crop up more
than once.
‘A name will appear in the search results if ANY of its related entries match the search criteria. So, in the case of the SMITH
record, Mrs Cherry Smith had a planning application in Caledonian ward in 2006 to cut down a tree, hence the match.
We’ve got a couple of ideas for solutions. One is to show on the result when the date of the most recent match is, another is
to expose date UI. They are of differing complexities, though.
Areas
‘We try to get one or more locations associated with a result (eg address of defendant, location of crime, location of
planning application) and if one or more of those locations is either in the postcode prefix list "N1", "N7", "WC1", "NW1" or
the words “islington” or “camden” appear in a field that we think might contain a description of a location, it matches.’