Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 53 Anzeige

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

Herunterladen, um offline zu lesen

Carta helps companies manage and secure their cap table and equity plans. Highly sensitive data. And in a post-GDPR world, data engineers play a critical role in protecting data and limiting access at each step in a data pipeline. In this session, Troy will walk through the steps that Carta’s data team has taken to secure the data pipeline using open source tools. You will leave with a checklist of things to consider when building a data lake, data warehouse, or deploying a data orchestration system. Some of the technologies covered include Apache Airflow, dbt, Docker, S3, Redshift, and Looker. Become a better steward of your customer’s data.

Carta helps companies manage and secure their cap table and equity plans. Highly sensitive data. And in a post-GDPR world, data engineers play a critical role in protecting data and limiting access at each step in a data pipeline. In this session, Troy will walk through the steps that Carta’s data team has taken to secure the data pipeline using open source tools. You will leave with a checklist of things to consider when building a data lake, data warehouse, or deploying a data orchestration system. Some of the technologies covered include Apache Airflow, dbt, Docker, S3, Redshift, and Looker. Become a better steward of your customer’s data.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (15)

Ähnlich wie OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey (20)

Anzeige

Aktuellste (20)

OSDC 2019 | Automating Security in Your Data Pipline by Troy Harvey

  1. 1. Automating Data Pipeline Security
  2. 2. 1 2 4 3 5 6
  3. 3. Carta’s Data Team is Hiring 🎉
  4. 4. Automating Data Pipeline Security
  5. 5. Automating Data Pipeline Security Privacy
  6. 6. 3 Big Ideas 1. Privacy has a strange history. 2. Privacy-first systems are designed by people with a professional ethic. 3. Privacy can be automated away. Automating security in your data pipeline privacy
  7. 7. 1. Strange History of Privacy
  8. 8. 16  “The actio iniuriarum was, in Roman law, a delict which served to protect the non-patrimonial aspects of a person's existence – who a person is rather than what a person has.”
  9. 9. ©1979 "The Invention of the Right to Privacy" by Dorothy J. Glancy
  10. 10. 2. Privacy-first Ethic
  11. 11. Software is eating the world.
  12. 12. “Audit defensibility is too low a bar when it comes to our customer’s privacy.”
  13. 13. GDPR EU General Data Protection Regulation ● Right of access ● Pseudonymisation ● Right of erasure ● Records of processing activities ● Privacy by design CCPA California Consumer Privacy Act ● Know what personal information is being collected ● Right to erasure ● Know whether their personal information is being shared, and if so, with whom ● Opt-out of the sale of their personal information Privacy Regulation
  14. 14. 3. Automate Privacy
  15. 15. “The security posture of your weakest vendor is the security posture of your entire organization.”
  16. 16. Blank Slide
  17. 17. ● Airflow DAGs to move data into S3 and Redshift ● DAG: Directed Acyclic Graph ● Operator/Task: A node in the graph ● Airflow runs dbt Workflow manager from Airbnb Apache Airflow
  18. 18. Apache Airflow ● Open source boilerplate for running Airflow in Docker ● Used at Carta Dockerized Airflow
  19. 19. How do we keep up with the sensitive columns being added in source data? Automating the blacklist updates Stale Blacklist
  20. 20. ● dbt tests fail when the result set is not empty. ● The records returned by dbt test are the offending records. Automated data tests dbt test
  21. 21. ● dbt tests fail when the result set is not empty. ● The records returned by dbt test are the offending records. Automated data tests dbt test
  22. 22. We have a custom access management system called Gatekeeper. Tools for requesting and granting access Automating Access
  23. 23. This example uses our IAM Service Account custom Terraform module to create a new Revenue Service account user with access to a single S3 data lake bucket. Automate Data Lake access Terraform Modules
  24. 24. Data Warehouse Migrations ● sql-migrate: Excellent cli and migrations library written in Go. ● Extended to support Jinja templating. We can rebuild the Warehouse from code.
  25. 25. Pseudonymity Disguised identity or “false name” ©2019 Alex Ewerlöf "GDPR pseudonymization techniques"
  26. 26. Pseudonymity: Obfuscation 👍 Easy to do in any language. 👍 No impact to downstream systems. 👎 Can be unscrambled. Scrambling or mixing up data
  27. 27. Pseudonymity: Masking 👍 Simple. 👍 Owner can verify the last 4 digits. 👎 Some pieces of the real data are stored. Obscure part of the data
  28. 28. Pseudonymity: Tokenization 👍 Popular libraries like Faker. 👍 All original data is replaced. 👎 No way to recover the original data. Replace real data with fake data
  29. 29. Pseudonymity: Blurring 👍 95% of this image is left unblurred. 👎 Possible to reverse blurring. Blur a subset of the data
  30. 30. Pseudonymity: Encryption 👍 The original data can be recovered. 👍 Manage fewer permissions downstream. 👎 Asymmetric vs Symmetric trade-offs. Two-way transformation of the data
  31. 31. AWS Key Management Service ● Generate a new data key for encrypting and decrypting data protected by a master key. ● Or manually rotate the master key and re-encrypt the data. Automate key creation and rotation
  32. 32. Encrypted Columns ● pgcrypto allows us to encrypt sensitive columns before the data lands in our S3 data lake. ● This example is encrypting the birth_date column in Postgres. Postgres pgcrypto
  33. 33. “Last Mile” Decryption ● Access to encrypted columns is limited to analysts with the encryption key. ● This example is decrypting the birth_date column in Redshift. Decrypt sensitive data at query time
  34. 34. Encrypted Column Problems Some things to consider... 1. Symmetric or Asymmetric encryption scheme? 2. Should we manually rotate our master key? 3. How many keys should we use and how should they be organized? 4. Should our analysts and data scientists need to think about keys? 5. When and how do we re-encrypt data? When an employee with access to keys leaves the company?
  35. 35. 3 Big Ideas 1. Privacy has a strange history. 2. Privacy-first systems are designed by people with a professional ethic. 3. Privacy can be automated away. Automating security in your data pipeline privacy
  36. 36. carta.com/jobs @troyharvey troy.harvey@carta.com

×