Taming the Compliance Beast:
Lessons learnt at LinkedIn
Sept 28, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
T...
Data Protection in a Digital World
PLAYING CATCH-UP WITH INNOVATION
GDPR
metric scripts
production code
Business facing
decision making
OUR VISION
Create economic opportunity for every
member of ...
The LinkedIn Privacy Paradox
“On one hand, the company has
500+ million members trusting
the company to protect highly
sen...
metric scripts
Members First is a Core Value for LinkedIn
MEMBER PRIVACY WHILE DELIVERING MEMBER VALUE
production code
Wel...
Data Is the Lifeblood of LinkedIn
MEMBER EXPERIENCES + BUSINESS DECISIONS
production code
Member Data
System of Intelligen...
We needed data democracy to
deliver member value
LinkedIn Data Science
I want to analyze as much data as
possible so my mo...
I want my personal data to be stored only
where needed and not propagated
unnecessarily
Data Protection
Need to Ensure Mem...
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
LinkedIn’s Data Ecosystem
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
Data Hubs at LinkedIn
In Motion
At Rest
Scale
O(10) clusters
~2.3 Trillion messages
~450 TB
Scale
O(10) clusters
~10K mach...
In Motion
At Rest
Data Integration
SFTP
JDBC
REST
Azure
Blob, Data
Lake
Storage
SFTP
JDBC
REST
Apache Gobblin: Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 dif...
REQUIREMENTS
Less Data
Legal: Right to Erasure or Right to be Forgotten
“Delete all my personal data without undue delay w...
A lot of data, different formats
Challenges
Understand HDFS data: organization, formats, …
Cycle asynchronously, within an...
Gobblin: The Logical Pipeline
Source
Work
Unit
Work
Unit
Work
Unit
Extract Convert Quality Write Data
Publish
WriteQuality...
Gobblin: Extending for Purge
HDFS
Work
Unit
Data
Publish
Extract Convert Quality Write
Task
Task
HDFS
If needs purge
then ...
STATUS AND CHALLENGES
Gobblin: Data Lifecycle Management at Scale
Status
Number of datasets: many thousands
Amount of data...
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
LinkedIn’s Data Ecosystem
Metadata based Search Experience
for Data Scientists
Data Discovery
Where is dataset X?
How did it get created?
Usage : In...
SEARCH SCREENSHOTS
WhereHows
LINEAGE SCREENSHOTS
WhereHows
More than just Discovery
Use Cases
Which datasets at LinkedIn contain PII or highly
confidential data?
How many contain me...
Wide + Deep
Metadata
Comprehensive coverage of data systems at LinkedIn
We have > 20 systems!
SQL, NoSQL, Indexes, Blob St...
A METADATA REFINERY APPROACH
WhereHows Architecture @ 10,000 ft
ML driven
refinements
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
METADATA
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Ac...
FREEDOM OF EXPRESSION
Many Transformation Engines @ LinkedIn
In Motion
At Rest
HARD TO CHANGE ANYTHING UNDERNEATH!
Challenge for Infrastructure Providers
(Pig scripts)
My Raw Data
Native readers, depen...
HARD TO CHANGE ANYTHING UPSTREAM!
Semantic Challenges
Data is unclean (bad data on certain dates)
Data models are in const...
AN API TO MANAGE EVOLUTION
We need “microservices” for Data
My Data API
My Raw Data
A DATA ACCESS LAYER FOR LINKEDIN
We built Dali to solve this
Logical Tables + Views
Logical FileSystem
Abstract away under...
Dali: Implementation Details in Context
Dali FileSystem
Processing Engine
(MR, Spark)
Dali Datasets (Tables+Views)
Dataflo...
Simple to Complex
Different Types
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts ...
STEP 1: DATA + METADATA
Solving for Compliant Access
Schema = {
int memberId
String firstName
String lastName
Position[] po...
STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
A BITMAP DATASET: ONE PER MEMBER
Privacy Preferences
Member Privacy
Preferences
Solving for Compliant Access With Dali
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Dali Reader responsibility:
Given:...
Solving for Compliant Purging With Dali + Gobblin
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Gobblin
Purger
Dali
Rea...
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
DATA DEMOCRACY + DATA PROTECTION
The Technology Blueprint
WhereHows*
Dali Apache Gobblin*
* Open Source : We can collabora...
Core company value, implemented
by Technology & Process
Privacy By Design
Privacy : Technology + Process
SUSTAINABILITY IS...
Getting Stricter and more complex
Data Protection
Key Takeaways
THE BEAST IS REAL
Stricter regulations in a digital world
...
We’ve established a blueprint to
sustainably address privacy
Learnings at LinkedIn
Key Takeaways
THE BEAST CAN BE TAMED !
...
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The ...
Thank You!
Nächste SlideShare
Wird geladen in …5
×

Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]

70.203 Aufrufe

Veröffentlicht am

Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.

Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.

But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.

Veröffentlicht in: Technologie
  • Great job, compliance is a lot of times an obstacle in fast development
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • http://www.mmoem.ga
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • This review is written with clear conscience after reading terrible things done to some innocent people who needed help getting back their partner by some unsuspecting spell casters. It was too shameful to say this, but I victimized twice due to my desperation when I was going through harsh closure divorce. The funny thing is that my colleague was also a victim. It took us couple of months to research for the best among the rest. If I am being asked to rate a spell caster according to the authenticity of their work, I will give a 5-star rating to Dr. Wakina for not just his breathtaking work but for restoring the faith in humanity. Note: this is not a cheap publicity, but a true story of my life, a woman who once lost everything to another appealing woman. I and my friend welcome back our partners via the help of Dr. Wakina. By restoring my faith and reuniting my family I believe Dr. Wakina through his email dr.wakinalovetemple@gmail.com can help a lot of people out there currently going through broken heart, home and divorce.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Very helpful. Much appreciated. Thanks for sharing. If you looking for Pro Web Designer and Developer https://www.fiverr.com/mahmudulhossain
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]

  1. Taming the Compliance Beast: Lessons learnt at LinkedIn Sept 28, 2017 Shirshanka Das, Principal Staff Engineer, LinkedIn Tushar Shanbhag, Head of Data Products, LinkedIn @shirshanka, @tusharis ever-evolving ^
  2. Data Protection in a Digital World PLAYING CATCH-UP WITH INNOVATION GDPR
  3. metric scripts production code Business facing decision making OUR VISION Create economic opportunity for every member of the global workforce LinkedIn’s Vision 29K schools 10M companies 11B endorsements 500M Members 10M jobs
  4. The LinkedIn Privacy Paradox “On one hand, the company has 500+ million members trusting the company to protect highly sensitive data. On the other hand, one only joins the largest professional network on the Internet because they want to be found !"            Kalinda Raina, Head of Global Privacy, LinkedIn MEMBER PRIVACY <> MEMBER DISCOVERY
  5. metric scripts Members First is a Core Value for LinkedIn MEMBER PRIVACY WHILE DELIVERING MEMBER VALUE production code Well-connected. Get relevance right. Few connections. Give them inventory. Example Member value is proportional to knowledge Member privacy is paramount for LinkedIn We strive to maintain this fine balance
  6. Data Is the Lifeblood of LinkedIn MEMBER EXPERIENCES + BUSINESS DECISIONS production code Member Data System of Intelligence Member Experiences Business Decisions
  7. We needed data democracy to deliver member value LinkedIn Data Science I want to analyze as much data as possible so my models are accurate Data Democracy ALL THE DATA, ALL THE TIME I want to discover data that’s needed for my analysis as fast as possible I want to access that data as quickly as possible for my analysis

  8. I want my personal data to be stored only where needed and not propagated unnecessarily Data Protection Need to Ensure Member Privacy LinkedIn Members STORE, PROCESS, DELETE,.. I want my personal data to be deleted when I close my account or request deletion I want my personal data to only be processed if essential and only if I consent
  9. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox
  10. LinkedIn’s Data Ecosystem
  11. LinkedIn’s Data Ecosystem
  12. LinkedIn’s Data Ecosystem
  13. LinkedIn’s Data Ecosystem
  14. LinkedIn’s Data Ecosystem
  15. LinkedIn’s Data Ecosystem
  16. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox
  17. Data Hubs at LinkedIn In Motion At Rest Scale O(10) clusters ~2.3 Trillion messages ~450 TB Scale O(10) clusters ~10K machines ~100 PB
  18. In Motion At Rest Data Integration SFTP JDBC REST Azure Blob, Data Lake Storage
  19. SFTP JDBC REST Apache Gobblin: Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ https://gobblin.apache.org/ Stream + Batch Adopted by LinkedIn, Intel, PayPal, Apple, IBM, Swisscom, Prezi, AppLift, NerdWallet and many more… SFTP Azure Blob, Data Lake Storage
  20. REQUIREMENTS Less Data Legal: Right to Erasure or Right to be Forgotten “Delete all my personal data without undue delay when it is no longer necessary / when consent has been withdrawn” Engineering: Need the ability to delete some specific subset or all data associated with a specific LinkedIn member from all our data systems
  21. A lot of data, different formats Challenges Understand HDFS data: organization, formats, … Cycle asynchronously, within an SLA, deleting records, without affecting running jobs Quarantine exceptional records for manual triage Can scale to processing hundreds of PB of data Data Deletion IMPLICATIONS FOR HADOOP
  22. Gobblin: The Logical Pipeline Source Work Unit Work Unit Work Unit Extract Convert Quality Write Data Publish WriteQualityConvertExtract Extract Convert Quality Write Task Task Task
  23. Gobblin: Extending for Purge HDFS Work Unit Data Publish Extract Convert Quality Write Task Task HDFS If needs purge then drop else continue Member’s Delete Requests
  24. STATUS AND CHALLENGES Gobblin: Data Lifecycle Management at Scale Status Number of datasets: many thousands Amount of data scanned for purge: XXX TB/day Challenges Immutable Storage Formats +  Right to Erasure = Unhappy Disks “Widespread implementation will surely lead to innovation in these formats!”
  25. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT
  26. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT
  27. LinkedIn’s Data Ecosystem
  28. Metadata based Search Experience for Data Scientists Data Discovery Where is dataset X? How did it get created? Usage : In production since 2014 Users : Data Scientists, Product Engineers Use Cases: Discovery, Impact Analysis WhereHows FIND DATA, NAVIGATE RELATIONSHIPS Open source @ github.com/linkedin/wherehows
  29. SEARCH SCREENSHOTS WhereHows
  30. LINEAGE SCREENSHOTS WhereHows
  31. More than just Discovery Use Cases Which datasets at LinkedIn contain PII or highly confidential data? How many contain member-member messages? How many of them are accessible by team X? Have all datasets been purged within SLA? Discovering Violations ANSWERING HARDER QUESTIONS
  32. Wide + Deep Metadata Comprehensive coverage of data systems at LinkedIn We have > 20 systems! SQL, NoSQL, Indexes, Blob Stores, … Deeper understanding of each dataset Schema is not enough Need to understand semantics Discovering Violations REQUIREMENTS
  33. A METADATA REFINERY APPROACH WhereHows Architecture @ 10,000 ft ML driven refinements
  34. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT METADATA
  35. METADATA DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT
  36. FREEDOM OF EXPRESSION Many Transformation Engines @ LinkedIn In Motion At Rest
  37. HARD TO CHANGE ANYTHING UNDERNEATH! Challenge for Infrastructure Providers (Pig scripts) My Raw Data Native readers, dependencies on path, format hard-coded Hard to move to better formats without breaking everyone or copying data twice My Raw Data
  38. HARD TO CHANGE ANYTHING UPSTREAM! Semantic Challenges Data is unclean (bad data on certain dates) Data models are in constant flux (split event into multiple) Have to change data processing logic everywhere! My Raw Data
  39. AN API TO MANAGE EVOLUTION We need “microservices” for Data My Data API My Raw Data
  40. A DATA ACCESS LAYER FOR LINKEDIN We built Dali to solve this Logical Tables + Views Logical FileSystem Abstract away underlying physical details to allow users to focus solely on the logical concerns
  41. Dali: Implementation Details in Context Dali FileSystem Processing Engine (MR, Spark) Dali Datasets (Tables+Views) Dataflow APIs (MR, Spark, Scalding) Query Layers (Pig, Hive, Spark) Dali CLI Data Catalog Git + Artifactory View Def + UDFs Dataset Owner Data Source Data Sink
  42. Simple to Complex Different Types Basic Restrictions Access to dataset based on business need Privacy by Default Analysts shouldn’t get access to raw PII by default Consent-based Access Access to certain data elements only available if member has consented for that particular use- case Access Restrictions REQUIREMENTS
  43. STEP 1: DATA + METADATA Solving for Compliant Access Schema = { int memberId String firstName String lastName Position[] positions educationHistory[] educationHistory … } MemberProfile MEMBER_ID NAME PROFILE DATA NAME : is_pii MEMBER_ID : is_pii Raw Dataset Meta Data
  44. STEP 2: A MEMBER’S PREFERENCES Privacy Preferences
  45. A BITMAP DATASET: ONE PER MEMBER Privacy Preferences Member Privacy Preferences
  46. Solving for Compliant Access With Dali Raw Dataset Meta Data Member Privacy Preferences Dali Reader responsibility: Given: (Dataset, Metadata, UseCase) Generate: Dataset and Column-level transformations (obfuscate, null, …) Auto-join with Member Privacy Preferences (filter out data elements that are not consented to) Processing Logic Dali Reader Library Use Case = X
  47. Solving for Compliant Purging With Dali + Gobblin Raw Dataset Meta Data Member Privacy Preferences Gobblin Purger Dali Reader Library Use Case = Purge Member’s Delete Requests Purged Dataset
  48. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox DATA LIFECYCLE MANAGEMENT METADATA DATA ACCESS LAYER
  49. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox : Solved ! METADATA DATA ACCESS LAYER DATA LIFECYCLE MANAGEMENT
  50. DATA DEMOCRACY + DATA PROTECTION The Technology Blueprint WhereHows* Dali Apache Gobblin* * Open Source : We can collaborate on these together! DATA LIFECYCLE MANAGEMENTDATA ACCESS LAYER METADATA
  51. Core company value, implemented by Technology & Process Privacy By Design Privacy : Technology + Process SUSTAINABILITY IS CRITICAL Product : Security & Privacy Review Data : Data Model Review Legal : Regulation change -> Tech requirements Company-wide : “Horizontal” Initiatives
  52. Getting Stricter and more complex Data Protection Key Takeaways THE BEAST IS REAL Stricter regulations in a digital world Increasingly more complex to implement This is an accelerating global trend
  53. We’ve established a blueprint to sustainably address privacy Learnings at LinkedIn Key Takeaways THE BEAST CAN BE TAMED ! Privacy By Design : baked into technology stack & product development process Standardization : To solve at scale, certain parts need to be centralized and standardized Company-wide : Needs co-ordinated effort across various functions
  54. DATA DEMOCRACY <> DATA PROTECTION More Data Discover Data Easy Access Less Data Discover Violations Restricted Access The Data Paradox : Solved ! METADATA DATA ACCESS LAYER DATA LIFECYCLE MANAGEMENT
  55. Thank You!

×