GDPR is the current focus of all organizations working with data in Europe. Svenska Spel are using Atlas and Ranger as an essential part to make our
data warehouse to conform to GDPR.
This presentation will show how Atlas and Ranger, together with Hive are used in practice to control access and masking of data in our data warehouse. We will demonstrate how we with a model driven development process, using data vault modeling, have extended our model with metadata. From the metadata we generate our access control configuration. The configuration is automatically synchronized to Atlas and Ranger in our deployment process.
Speaker
Magnus Runesson, Data Engineer, Svenska Spel
4. 42018-04-25
• This talk does not cover all we have done around GDPR
• This is NOT a way to say if you do this you are GDPR compliant.
• Some details are left out or simplified
Disclaimer
5. 52018-04-25
• Why?
• Svenska Spel’s data warehouse
• Atlas & Ranger
• How did we implement it?
• Experiences and conclusions
Agenda
6. 62018-04-25
GDPR requires
• clear purpose for PII data
• privacy by design
• clear consent or legal ground
• not to use/store PII if not needed
• people own their own data.
• penalty if not followed
Why?
7. 72018-04-25
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Goals
9. 92018-04-25
• Moved from classic Cognos + Oracle
• HDP 2.6 using Hive
• Includes Personal Identifiable Information (PII)
• 300+ event streams in
• 150 published tables and views
Svenska Spel’s data warehouse
10. 102018-04-25
• Used data are
• Understood
• Documented
• Modelled
• Modelled with Data Vault
• Oracle SQL Developer Data Modeler
• SQL code generated from model
Model based development
11. 112018-04-25
• History tracking
• Uniquely linked
• Pattern based
• Easy to generate code
Data Vault
Link
Hub
Hub
Satellite
Satellite Satellite
14. 142018-04-25
• Metadata about resources
• Resource is
• Table
• Column
• Schema
• File on HDFS
• …
• Lineage
Apache Atlas
15. 152018-04-25
• Tags have no meaning themselves
• Your business vocabulary define the meaning
• Example of tags:
• Business entity owning the data
• Indication of sensitive data
• The rules in Ranger enforces the policy
• Separate metadata from policy implementation
Atlas tags
PII
16. 162018-04-25
• Is user U allowed to do operation O on resource R?
• Access
• Row based filtering
• Masking
• Audit logging
• Resources referred with tags
Apache Ranger
18. 182018-04-25
customer
Customer_id Name Postal_code Has_phone Marketing
1 Steve 12345 False False
2 Bill 54321 True False
3 Paul 54672 False True
PII_table
PII
Add PII tags on table and columns in Atlas.
No behaviour change.
PII
19. 192018-04-25
customer
Customer_id Name Postal_code Has_phone Marketing
17 ABC 12345 False False
42 DEF 54321 True False
13 BDE 54672 False True
PII
We set a rule in Ranger to mask PII columns
Analyst view
PII_table
PII
25. 252018-04-25
• Hand coded of rules per tag
• Policy tool applies rule on all tables with the tag
• Can be different rules for different users
• Filter gets appended to where condition by Ranger
• Used for
• Row based filtering (access)
• Masking (anonymization)
• Catch all rule to deny access to tables not in our model
Ranger rules
28. 282018-04-25
• Makes it easy to manage
• Atlas tags
• Ranger policy rules
• Command line tool
• Consumes tags and policy CSV files
• Calls Atlas and Ranger API
• Less than 1000 rows of Python
Policytool
Policytool
CSV
33. 332018-04-25
• Simple and easy model
• Limited performance penalty
• Tag on table with masking rule => all columns masked
• Hard to understand API doc
• Restriction on Ranger row based filtering (not on tags)
• Row based filtering and masking not on direct file access
Experiences of Atlas and Ranger
34. 342018-04-25
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Reached Goals
35. 352018-04-25
• Goals reached
• No SQL changes
• Scale when new datasets added
• Our data model is guaranteed in sync
• Better comments in Hive
• Minimal impact on ETL developers workflow
Conclusions
36. 362018-04-25
• Make it as simple as possible
• Automate
• Know your tool
• Be clear on your authorization model
• Know your data
Takeaways