Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Privacy by design

1.039 Aufrufe

Veröffentlicht am

Privacy and personal integrity has become a focus topic, due to the upcoming GDPR deadline in May 2018. GDPR puts limits on data storage, retention, and access, and also give users rights to have their data deleted and get information about the data stored. This constraints technical solutions, and makes it challenging to build systems that efficiently make use of sensitive data. This talk provides an engineering perspective on privacy. We highlight pitfalls and topics that require early attention. We describe technical patterns for complying with the "right to be forgotten" without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Privacy by design

  1. 1. Privacy by design Jfokus, 2018-02-07 Lars Albertsson www.mapflat.com 1
  2. 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) ○ ~15 clients: Spotify, 3 banks, 3 conglomerates, 4 startups, 5 *tech, misc 2
  3. 3. Privacy protection resources 3 All of this might go wrong. Large fine. Pour your data into our product. 404
  4. 4. Privacy by design ● Required by GDPR ● Technical scope ○ Engineering toolbox ○ Puzzle pieces - not complete solutions ● Assuming that you solve: ○ Legal requirements ○ Security primitives ○ ... ● Disclaimers: ○ This is not a description of company X ○ This is not legal / compliance advice 4 Archi- tecture Statis- tics Legal Organi- sation Security Process Privacy UX Culture
  5. 5. Requirements, engineer’s perspective ● Right to be forgotten ● Limited collection ● Limited retention ● Limited access ○ From employees ○ In case of security breach ● Consent for processing ● Right for explanations ● Right to correct data ● User data enumeration ● User data export 5
  6. 6. Ancient data-centric systems ● The monolith ● All data in one place ● Analytics + online serving from single database ● Current state, mutable - Please delete me? - What data have you got on me? - Please correct this data - Sure, no problem! 6 DB Presentation Logic Storage
  7. 7. Event-oriented / big data systems 7 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value
  8. 8. Event-oriented / big data systems 8 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value ● Motivated by ○ New types of data-driven (AI) features ○ Quicker product iterations ■ Data-driven product feedback (A/B tests) ■ Democratised data - fewer teams involved in changes ○ Robustness - scales to more complex business logic Enable disruption
  9. 9. Data lake Data processing at scale 9 Cluster storage Batch processing AI feature DatasetJob Pipeline Data-driven product development Analytics Cold store Ingress Egress
  10. 10. www.mapflat.com Workflow orchestrator ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ○ Robust system from fragile components ● DSL describes DAG ○ Includes ingress & egress The most important big data component - it keeps you sane Recommended: Luigi / Airflow 10 DB Orchestrator
  11. 11. Factors of success 11 Functional architecture: ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy - Please delete me? - What data have you got on me? - Please correct this data - Hold on a second...
  12. 12. Solution space 12 Technical feasibility Easy to do the right thing Awareness culture
  13. 13. Personal information (PII) classification You need to establish a field/dataset classification. Example: Is application content sensitive? Depends. ● Music, video playlists - perhaps not ● Running tracks, taxi rides - apparently ● In-application messages - probably 13 ● Red - sensitive data ○ Messages ○ GPS location ○ Views, preferences ● Yellow - personal data ○ IDs (user, device) ○ Name, email, address ○ IP address ● Green - insensitive data ○ Not related to persons ○ Aggregated numbers ● Grey zone ○ Birth date, zip code ○ Recommendation / ads models?
  14. 14. PII arithmetics ● Most sensitive data wins ○ red + green = red ○ red + yellow = red ○ yellow + green = yellow ● Aggregation decreases sensitivity ○ sum(red/yellow) = green ? ● Combinations may increase sensitivity ○ green + green + green = yellow ? ○ yellow + yellow + yellow = red ? ● Machine learning models store hidden information ○ model(yellow) = yellow or green ? ○ Overfitting => persons could be identified 14
  15. 15. Make privacy visible at ground level Suggestions: ● In dataset names ○ hdfs://red/crm/received_messages/year=2017/month=6/day=13 ○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13 ● In field names ○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …” ● In credential / service / table / ... names ● In metadata ● Spreads awareness ● Catch mistakes in code review ● Enables custom tooling for violation warnings - Difficult to change privacy level 15
  16. 16. Eye of the needle tool ● Provide data access through gateway tool ○ Thin wrapper around Spark/Hadoop/S3/... ○ Hard-wired configuration ● Governance ○ Access audit, verification ○ Policing/retention for experiment data 16
  17. 17. Eye of the needle tool ● Easy to do the right thing ○ Right resource choice, e.g. “allocate temporary cluster/storage” ○ Enforce practices, e.g. run jobs from central repository code ○ No command for data download ● Enabling for data scientists ○ Empowered without operations ○ Directory of resources 17
  18. 18. Possible strategy: Privacy protection at ingress Scramble on arrival + Simple to implement - Limits value extraction - Deanonymisation possible IMHO not a feasible strategy 18
  19. 19. Privacy protection at egress Processing in opaque box + Enabling + Simpler to reason about - Strict operations required - Exploratory analytics need explicit egress / classification 19 Machines are allowed to see intermediate data Humans & services interact with exported data
  20. 20. Permission to process ● Processing personal data requires a sanction ○ Business motive is not sufficient ● Explicit sanction ○ Consent from user ○ Necessary to perform core service ● Implicit sanction ○ Required by regulations ■ Detect money laundry, fraud, abuse ■ Bookkeeping ● Not exempt user ○ Not underage ○ Not politically exposed person ○ No hidden identity 20
  21. 21. ● Consent applies at processing date, not collection date class BiPageView(Task): date = DateParameter() def requires(self): return [PageView(self.date), User(self.date), BIConsent.latest()] Consent workflow 21 User BIConsent BIOkUser PageView User BIPage View Normal decoration join - same date User BIOkUser User BIConsent BIPage View Consent join - always latest User PageView
  22. 22. Towards oblivion ● Left to its own devices, personal (PII) data spreads like weed ● PII data needs to be governed, collared, or discarded ○ Discard what you can 22
  23. 23. ● Discard all PII ○ User id in example ● No link between records or datasets ● Replace with non-PII ○ E.g. age, gender, country ● Still no link ○ Beware: rare combination => not anonymised Drop user id Discard: Anonymisation 23 Replace user id with demographics Useful for business insights Useful for metrics
  24. 24. Partial discard: Pseudonymisation ● Hash PII ● Records are linked ○ Across datasets ○ Still PII, GDPR applies ○ Persons can be identified (with additional data) ○ Hash recoverable from PII ● Hash PII + salt ○ Hash not recoverable ● Records are still linked ○ Across datasets if salt is constant 24 Hash user id Useful for recommendations Hash user id + salt Useful for product insights
  25. 25. ● Push reruns with workflow orchestrator - No versioning support in tools - Computationally expensive - Easy to miss datasets - PII in cleartext everywhere + No data model changes required + Usually necessary for egress storage Governance: Recomputation 25
  26. 26. ● Fields reference PII table ● Clear single record => oblivion - PII table injection needed - Key with UUID or hash - Extra join - Multiple or wide PII tables + PII table can be well protected Ejected record pattern 26
  27. 27. ● Datasets are immutable - must not remove records ● Version n+1 of raw dataset lacks record ● Short retention of old versions ● Always depend on latest version ○ What about changing PII, e.g. address? Need versioning in data model? Record removal in pipelines 27 User PII 2017-06-12 User PII 2017-06-13 class Purchases(Task): date = DateParameter() def requires(self): return [Users(self.date), Orders(self.date), UserPII.latest()]
  28. 28. ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Requires user-defined function in SQL? - Decryption (user) id needed + Multi-field oblivion + Single dataset leak → no PII leak + Handles changing PII fields Lost key pattern 28
  29. 29. ● Different fields encrypted with different keys ● Partial user oblivion ○ E.g. forget my GPS coordinates Lost key partial oblivion 29
  30. 30. ● Encrypt key fields that link datasets ● Ability to join is lost ● No data loss ○ Salt => anonymous data ○ No salt => pseudonymous data Lost link key 30
  31. 31. Reversible oblivion ● Lost key pattern ● Give ejected record key to third party ○ User ○ Trusted organisation ● Destroy company copies 31
  32. 32. ● Input: ○ Page view events ○ User account creations ○ User deletion requests ● Business job outputs: ○ Web daily active user count, per country ○ Duplicate display name detection → email Example: Lost key pattern 32 Raw NewUser Raw Forgotten User WebDau UserName Duplicate User service DB Raw PageView Web app
  33. 33. ● Split RawNewUser ○ Encryption key ○ Non-PII + encrypted PII Example: Lost key pattern 33 NewUser NewUser Key Raw NewUser Raw Forgotten User WebDau UserName Duplicate User service DB Raw PageView Web app Joinable
  34. 34. ● UserKey = latest user encryption keys ○ Recursive - depends on yesterday ○ Yesterday's + new - forgotten ● User = all users ever seen ○ Recursive ○ Yesterday's + new ■ grows forever ○ Encrypted PII Example: Lost key pattern 34 NewUser NewUser Key Raw NewUser User Raw Forgotten User WebDau UserName Duplicate UserKey User service DB Raw PageView Web app Joinable
  35. 35. ● Encrypt page view PII ○ Pseudonymised ● WebDau aggregation requires no PII ● UserNameDuplicate requires email for push ○ Depend on UserKey.latest for decrypting email in User ○ Egress DB should have limited retention Example: Lost key pattern 35 NewUser NewUser Key Raw NewUser User Raw Forgotten User WebDau UserName Duplicate UserKey User service Latest day dependency DB Raw PageView Web app PageView Joinable Must have retention
  36. 36. Tombstone line ● Produce dataset/stream of forgotten users ● Egress components, e.g. online service databases, may need push for removal. ○ Higher PII leak risk 36 DB Service
  37. 37. The art of deletion ● Example: Cassandra ● Deletions == tombstones ● Data remains ○ Until compaction ○ In disconnected nodes ○ ... Component-specific expertise necessary 37
  38. 38. Deletion layers ● Every component adds deletion burden ○ Minimise number of components ○ Ephemeral >> dedicated. Recycle machines. ● Every storage layer adds deletion burden ○ Minimise number of storage layers ○ Cloud storage requires documented erasure semantics + agreements. ● Invent simple strategies ○ Example: Cycle Cassandra machines regularly, erase block devices. Increasing cost of heterogeneity & on premise storage. 38
  39. 39. Data model deadly sins ● Using PII data as key ○ Username, email ● Publishing entity ids containing PII data ○ E.g. user shared resources (favourites, compilations) including username ● Publishing pseudonymised datasets ○ They can be de-pseudonymised with external data ○ E.g. AOL, Netflix, ... 39
  40. 40. Retention limitation ● Best solved in workflow orchestration ○ Creation and destruction live together ● Short default retention ○ Whitelist exceptions with long retention ● In conflict with technical ideal of immutable raw data 40
  41. 41. Cluster storage Cluster storage Lake freeze ● Remove expire raw dataset, freeze derived datasets ● Workflow DAG still works 41 Cold store Derived Cold store Derived
  42. 42. www.mapflat.com What about streaming? 42 Job Ads Search Feed App App App StreamStream Stream Stream Stream Stream Job Job Stream Stream Stream Job Job Data lake Business intelligence Job ● Unified log - bus of all business events ○ Streams = infinite datasets ● Pipelines with stream processing jobs ○ Governance & reprocessing difficult ● Ejected record & lost key patterns work ○ PII or encryption key in database table ● Retention is naturally limited
  43. 43. www.mapflat.com Correcting invalid data = human in the loop ● Humans are lousy data processors ○ Expensive to execute ○ Not completely deterministic ○ Not ready to kick off at 2 am ○ Don't read Avro very well ○ Not compatible with CI/CD 43 ● Add human curation to cold store ○ Pipeline job merges human curation input ○ Overrides data from other sources Curation service Human overrides
  44. 44. Resources Credits● https://www.slideshare.net/lallea/protectin g-privacy-in-practice ● http://www.slideshare.net/lallea/data- pipelines-from-zero-to-solid ● http://www.mapflat.com/lands/resources/re ading-list ● https://ico.org.uk/ ● EU Article 29 Working Party ● ENISA: "Privacy by design in big data" ● GDPR-podden 45 ● Alexander Kjeldaas, independent ● Lena Sundin, independent ● Oscar Söderlund, Spotify ● Oskar Löthberg, Spotify ● Sofia Edvardsen, Sharp Cookie Advisors ● Øyvind Løkling, Schibsted Media Group ● Enno Runne, Baymarkets

×