Presentation given by Chris Taggart, CEO and Co-Founder of OpenCorporates at Open Knowledge Festival, Geneva, September 2013
Discussing benefits and quality of open corporate hierarchy (network) data
7. Even though open
data is better
(than closed/proprietary)
• Better for innovation
• Better for competition
8. Even though open
data is better
(than closed/proprietary)
• Better for innovation
• Better for competition
• Better for efficiency
9. Even though open
data is better
(than closed/proprietary)
• Better for innovation
• Better for competition
• Better for efficiency
• Better for sharing (esp cross-
organisation or cross-border)
10. But open has a secret
weapon
http://www.flickr.com/photos/x-ray_delta_one/8493335701/sizes/l/in/photostream/
11. It’s better quality too
http://www.flickr.com/photos/infusionsoft/4484373179/sizes/l/in/photostream/
12. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
13. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
14. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
15. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
16. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
17. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
18. Problem Cause
Data accuracy
Data is re-keyed. Few eyeballs.
Often little downside to lying
Gaps in data
High (& often duplicated) cost of
data entry. Limited to payers
Lack of granularity
Legacy systems/data models hard
to reengineer in closed world
Errors go uncorrected Few feedback mechanisms
Black box/No
provenance
Can’t reveal (sometimes dubious)
sources. Limits usefulness/trust
Isolated
Proprietary IDs are internal
identifiers & are barriers to
sharing & improved data quality
Common proprietary
data quality issues
20. Hugely important
(and valuable)
• The dataset we need to understand
the corporate world
• Who we (or the government) is really
doing business with
• Political influence/donations/lobbying
• Tax/resource extraction
• Corporate Governance
• Credit risk
21. But proprietary datasets
on this are problematic
• Expensive, so relatively few users
• Huge gaps in data
• Uses proprietary IDs (so not clear
what it’s refers to)
• Restrictive licences
• Opaque – no info re calculations,
provenance or confidence
22. But proprietary datasets
on this are problematic
• Expensive, so relatively few users
• Huge gaps in data
• Uses proprietary IDs (so not clear
what it’s refers to)
• Restrictive licences
• Opaque – no info re calculations,
provenance or confidence
Result: low-quality data
38. The company that wants to know
your network... every friend...
every interaction
http://www.flickr.com/photos/jeffmcneill/5260815552/sizes/l/
why bother?
41. Facebook, Inc
Pinnacle Sweden AB
Vitesse LLC
Facebook Operations LLC
Facebook Ireland Limited
Edge Network Services Limited
Andale Acquisition Corp
(and turned into data)
This is what we got from
their SEC filings as text
42. Facebook Ireland Limited
Edge Network Services Limited
Pinnacle Sweden AB
Vitesse LLC
Facebook Operations LLC
Andale Acquisition Corp
Then we started
investigating
Facebook, Inc