This document summarizes the privacy implications of publicly available NYC taxi trip data. It describes how one researcher was able to de-anonymize the taxi driver and vehicle data, which included sensitive information like the exact pickup and dropoff locations and times of passenger trips. This could reveal passengers' private information like home addresses or locations frequented. The document discusses the debate around privacy rights of taxi drivers and passengers in relation to this type of public data. It notes efforts by the TLC to address these issues in future open data releases.
10. “…you will need to send or bring an
external hard drive with a minimum
capacity of 200 GB to the TLC
offices. The address is listed below.
The hard drive must be brand new,
still in the box and unopened. “
13. Fare Data – 12 CSVs (~2GB
each)
• Payment Type
• Fare
• Tax
• Tip (Credit Card Only)
• Tolls
Trip Data – 12 CSVs (~2GB each)
• Medallion
• Driver
• Pickup Time and Location
• Dropoff Time and Location
173 Million Trips in 2013!
14.
15.
16.
17.
18.
19.
20. “…one specific driver seemed to be
doing an incredible amount of
business…”
CFCD208495D565EF66E7DFF9F98764DA
After a little bit of poking around, I realised that that code is
actually the MD5 hash of the character ‘0’. This proved my
suspicion that this was actually a data collection error, but
also made me immediately realise that the entire
anonymization process was flawed and could easily be
reversed.”
21. Driver and Vehicle De - Anonymization
hash = md5(medallionNumber)
Medallion: 6B111958A39B24140C973B262EA9FEA5
Hack License: D3B035A03C8A34DA17488129DA581EE7
There are only 19 million possible values!
22.
23.
24.
25.
26.
27. Issue 1 – Privacy of Taxi
Drivers/Taxi Companies
• Do cab drivers have a reasonable expectation of privacy
when operating their cabs?
• If data is required to be submitted to the TLC, does it
belong to the people as well?
• What are the real harms to drivers from this de-anon?
• What are the benefits of driver/vehicle data to the public?
28. HASSAN v YASKEY – U.S. District Court – Southern District of New York
January 29, 2014
29.
30.
31. A consequence of the De-anon
• As of July 2014, The TLC no longer includes the medallion
and hack license columns in trip data releases.
32. Issue 2 – Passenger Privacy
• Dataset as Lookup Table: If you know when and where a
trip started, you can find out when and where it ended.
• Patterns in the data around time and location could
represent a single person or class of people.
• * Neither of these requires vehicle or driver information
37. “Examining one of the clusters in the map above
revealed that only one of the 5 likely drop-off
addresses was inhabited; a search for that address
revealed its resident’s name.
In addition, by examining other drop-offs at this
address, I found that this gentleman also frequented
such establishments as “Rick’s Cabaret” and
“Flashdancers”.
Using websites like Spokeo and Facebook, I was
also able to find out his property value, ethnicity,
relationship status, court records and even a profile
picture!”
39. More Policy Implications
• Taxi data is collected using T-PEP systems and submitted by the system
vendors
• Limousine companies must keep trip records and have them available
for inspection
• TLC recently proposed a rule change that would require limo companies
(including Uber) to submit electronic tripsheet data.
Uber NYC Testimony to the TLC – Oct 16, 2014
“Finally, the TLC's proposal would require base owners to transmit all
records to -- all trip records to the TLC but does not protect passenger
privacy and may place it at risk. Such sensitive trip data could be
disclosed either purposely through a third-party request or inadvertently
to a wider audience, potentially undermining the privacy of drivers,
passengers and bases. The rules don't provide for anonymization of the
data or explain how trip records will be kept confidential.”
40. More Policy Implications
Uber NYC Testimony to the TLC – Oct 16, 2014
“…Right. But it's not just the area. It's the lat-long so it's the exact point,
which if cross-referenced with say a paparazzi photo, as happened
recently with open taxi data could indicate where somebody lives. And we
just think that to achieve the goal of the TLC, there are other types of
data.”
45. • Shared via Torrent and Direct Download
• Urban Data Nerds Rejoice!
• Lots of Interesting Projects and Analysis
46.
47.
48.
49.
50. Discussion
• Do drivers have a right to privacy when operating their cabs? Have they
consented to this when agreeing to the TLC’s rules?
• Should the public be made aware that the details of taxi trips become public
record?
• Is the tripsheet data just too detailed? Would “fuzzing” it sacrifice the
analytical benefit?
• Is the TLC right to eliminate vehicle and driver columns from future FOIL
requests?
• Should tripsheet data be Open Data?
• Should all vehicles regulated by the TLC be required to submit the same
data?
• Who should be responsible for anonymization of public data in the future?
What controls can we put on the process?