SlideShare ist ein Scribd-Unternehmen logo
1 von 15
1
Building a PII
scrubbing layer
Benefits of masking and tokenization of PII
data.
2
1. Ease of access for the data analysts
2. Ensuring the stakeholders from various systems feel safe in providing access to data.
3. Ease of sharing the data.
Primary transformations in our scrubbing
service
Building blocks for other transformations
3
Mask
B68 Shastri Nagar
Bhopal, 462003
xxxxx
Tokenise
Acc no:
00014590900
ac8a58d5da5fd35a7f
60cb906a724589e76
1f25e8fbea4346fd01
23ea351ea51
Derive
Age: 26
Age 20-30
Wipes out
information
completely
Useful for preserving
uniqueness
Useful for selective
preservation of
information
First glimpse:
- We needed cron to invoke this python script.
- It will pick up files from the disk, scrub, tokenize and write to S3.
Initial set of ideas:
1. Python with simple CSV reader.
2. One step ahead, maybe we can use pandas, provides good support for various
readers, and dataframe abstraction.
A simple scrubbing of CSVs, write the output to S3
4
Second glance:
Since, we had chosen pandas, with sqlalchemy interface we were able to achieve this.
We started to notice few behaviours:
1. Too much RAM was being consumed.
2. Reads were slow.
Usual suspects:
1. Intermediate copies.
2. Loading more data then needed.
3. No lazy evaluation.
4. No parallelization
Possible remedies:
1. Load less data aka chunking
2. Use efficient data types.
We need to connect to DBs.
5
Final reckoning
1. We need to connect to plethora of databases systems.
2. We might end up ingesting various raw formats like excels etc.
3. Orchestration would not be as simple as cron.
4. Backfilling of huge amounts of data.
5. We started to see pandas teeter under requirements were we needed to join
multiple data sources and perform aggregates, before pushing the scrubbed
dataset to S3.
We realised pandas won’t scale. We had to migrate to spark.
Towards the sprint end, some realisations
6
You do not need TBs of data for using spark.
Using spark does not necessarily mean you would have to invest and maintain into a spark cluster. If
your scale is small, start with smaller spark setup. You can gradually step up as the scale of data
increases
You can bring a pet elephant, and feed it accordingly.
7
Requirement Deployment mode
One job at a time and data size in few GBs. Local mode, (single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few GBs.
Standalone (Single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few 100 GBs.
Standalone (Multi-node or if a big machine using containers)
(local storage)
Multiple jobs with resource allocation in advanced scheduling
and data size in few 100 GBs.
Cluster mode (with YARN, Mesos etc as schedulers)
(distributed storage)
Ways to tokenize your data
Problem statement: In place of real keys in the data we need to put tokens, for given
key the token should be same, and given a key we should be able to tell the real data.
With spark in our arsenal, the next problem to solve.
8
Needs mapping table
No need of mapping
table
create
lookup
Encryption
Hashing with salt
Hashing with pepper
Using UUIDs
Using UUIDs
1. Though using UUIDs for generating token are better as they are completely
random and has no relation with the original data, but they pose the problem of
fetchOrCreate between parallel spark jobs.
9
Lookup Table (Cached Table or Table)
token(12232)
token(12232) token(42)
token(100)
Making a choice
10
Client needs
Lookup table is must and should be
encrypted, this gives them more control
over the tokenized information, than the
encryption method.
Our needs
We wanted to avoid fetchOrCreate problem
with multiple spark jobs trying to lookup
already generated token value.
We wanted the creation of tokens such that
two jobs arrive at the same token without
any coordination.
Using hashing with pepper
11
00014590900
ac8a58d5da5fd35a7f60cb90
6a724589e761f25e8fbea434
6fd0123ea351ea51
CUSTOMER_CODE_TOKEN CUSTOMER_CODE
ac8a58d5da5fd35a7f60
cb906a724589e761f25e
8fbea4346fd0123ea351
ea51
adfa58d5da5fd35a7f60c
b90sdfsd761f25e8fbea4
346fd0123ea351ea51==
CUSTOMER_CODE_TOKEN CUSTOMER_CODE
ac8a58d5da5fd35a7f60
cb906a724589e761f25e
8fbea4346fd0123ea351
ea51
adfa58d5da5fd35a7f60c
b90sdfsd761f25e8fbea4
346fd0123ea351ea51==
adfa58d5da5fd35a7f60
cb90sdfsd761f25e8fbe
a4346fd0123ea351ea5
1==
00014590900
lookup
store
fetch
decrypt
Token creation
Token fetch
The current state of our scrubbing service
1. Airflow 1.10.10 on docker, with CeleryExecutor 5 workers and 1 hourly worker.
2. 48 cores CPU and 128 GB RAM.
3. Spark and airflow on simple docker-compose.
4. 103 data pipelines pushing data to AWS from 28 data sources
5. Oracle has 9,43,68,195 number of tokens.
6. Alerts using emails and prometheus, grafana setup.
7. RBAC UI
The PII gatekeeper to the cloud
12
Spark on docker
1. Mount all the spark related scratch directories. Spark.local.dir or docker diff to
figure out what other data the container is generating.
2. Watchout for zombies: killing jobs, or executor retries leaves zombie processes
behind, since docker has no init system. Add tini or use docker run --init.
A glimpse of few challenges you might face
13
And the stored PII data mapping
It’s something like this
14
Thank You

Weitere ähnliche Inhalte

Ähnlich wie Building a PII scrubbing layer

Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 

Ähnlich wie Building a PII scrubbing layer (20)

Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander ZaitsevClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
ClickHouse Analytical DBMS. Introduction and usage, by Alexander Zaitsev
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at ScaleLeveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data GridsSpark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
Spark DC Interactive Meetup: HTAP with Spark and In-Memory Data Grids
 
ieee paper
ieee paper ieee paper
ieee paper
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Proposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Proposed Lightweight Block Cipher Algorithm for Securing Internet of ThingsProposed Lightweight Block Cipher Algorithm for Securing Internet of Things
Proposed Lightweight Block Cipher Algorithm for Securing Internet of Things
 

Kürzlich hochgeladen

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Kürzlich hochgeladen (20)

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 

Building a PII scrubbing layer

  • 2. Benefits of masking and tokenization of PII data. 2 1. Ease of access for the data analysts 2. Ensuring the stakeholders from various systems feel safe in providing access to data. 3. Ease of sharing the data.
  • 3. Primary transformations in our scrubbing service Building blocks for other transformations 3 Mask B68 Shastri Nagar Bhopal, 462003 xxxxx Tokenise Acc no: 00014590900 ac8a58d5da5fd35a7f 60cb906a724589e76 1f25e8fbea4346fd01 23ea351ea51 Derive Age: 26 Age 20-30 Wipes out information completely Useful for preserving uniqueness Useful for selective preservation of information
  • 4. First glimpse: - We needed cron to invoke this python script. - It will pick up files from the disk, scrub, tokenize and write to S3. Initial set of ideas: 1. Python with simple CSV reader. 2. One step ahead, maybe we can use pandas, provides good support for various readers, and dataframe abstraction. A simple scrubbing of CSVs, write the output to S3 4
  • 5. Second glance: Since, we had chosen pandas, with sqlalchemy interface we were able to achieve this. We started to notice few behaviours: 1. Too much RAM was being consumed. 2. Reads were slow. Usual suspects: 1. Intermediate copies. 2. Loading more data then needed. 3. No lazy evaluation. 4. No parallelization Possible remedies: 1. Load less data aka chunking 2. Use efficient data types. We need to connect to DBs. 5
  • 6. Final reckoning 1. We need to connect to plethora of databases systems. 2. We might end up ingesting various raw formats like excels etc. 3. Orchestration would not be as simple as cron. 4. Backfilling of huge amounts of data. 5. We started to see pandas teeter under requirements were we needed to join multiple data sources and perform aggregates, before pushing the scrubbed dataset to S3. We realised pandas won’t scale. We had to migrate to spark. Towards the sprint end, some realisations 6
  • 7. You do not need TBs of data for using spark. Using spark does not necessarily mean you would have to invest and maintain into a spark cluster. If your scale is small, start with smaller spark setup. You can gradually step up as the scale of data increases You can bring a pet elephant, and feed it accordingly. 7 Requirement Deployment mode One job at a time and data size in few GBs. Local mode, (single node) (local storage) Multiple jobs with resource allocation in FIFO scheduling and data size in few GBs. Standalone (Single node) (local storage) Multiple jobs with resource allocation in FIFO scheduling and data size in few 100 GBs. Standalone (Multi-node or if a big machine using containers) (local storage) Multiple jobs with resource allocation in advanced scheduling and data size in few 100 GBs. Cluster mode (with YARN, Mesos etc as schedulers) (distributed storage)
  • 8. Ways to tokenize your data Problem statement: In place of real keys in the data we need to put tokens, for given key the token should be same, and given a key we should be able to tell the real data. With spark in our arsenal, the next problem to solve. 8 Needs mapping table No need of mapping table create lookup Encryption Hashing with salt Hashing with pepper Using UUIDs
  • 9. Using UUIDs 1. Though using UUIDs for generating token are better as they are completely random and has no relation with the original data, but they pose the problem of fetchOrCreate between parallel spark jobs. 9 Lookup Table (Cached Table or Table) token(12232) token(12232) token(42) token(100)
  • 10. Making a choice 10 Client needs Lookup table is must and should be encrypted, this gives them more control over the tokenized information, than the encryption method. Our needs We wanted to avoid fetchOrCreate problem with multiple spark jobs trying to lookup already generated token value. We wanted the creation of tokens such that two jobs arrive at the same token without any coordination.
  • 11. Using hashing with pepper 11 00014590900 ac8a58d5da5fd35a7f60cb90 6a724589e761f25e8fbea434 6fd0123ea351ea51 CUSTOMER_CODE_TOKEN CUSTOMER_CODE ac8a58d5da5fd35a7f60 cb906a724589e761f25e 8fbea4346fd0123ea351 ea51 adfa58d5da5fd35a7f60c b90sdfsd761f25e8fbea4 346fd0123ea351ea51== CUSTOMER_CODE_TOKEN CUSTOMER_CODE ac8a58d5da5fd35a7f60 cb906a724589e761f25e 8fbea4346fd0123ea351 ea51 adfa58d5da5fd35a7f60c b90sdfsd761f25e8fbea4 346fd0123ea351ea51== adfa58d5da5fd35a7f60 cb90sdfsd761f25e8fbe a4346fd0123ea351ea5 1== 00014590900 lookup store fetch decrypt Token creation Token fetch
  • 12. The current state of our scrubbing service 1. Airflow 1.10.10 on docker, with CeleryExecutor 5 workers and 1 hourly worker. 2. 48 cores CPU and 128 GB RAM. 3. Spark and airflow on simple docker-compose. 4. 103 data pipelines pushing data to AWS from 28 data sources 5. Oracle has 9,43,68,195 number of tokens. 6. Alerts using emails and prometheus, grafana setup. 7. RBAC UI The PII gatekeeper to the cloud 12
  • 13. Spark on docker 1. Mount all the spark related scratch directories. Spark.local.dir or docker diff to figure out what other data the container is generating. 2. Watchout for zombies: killing jobs, or executor retries leaves zombie processes behind, since docker has no init system. Add tini or use docker run --init. A glimpse of few challenges you might face 13
  • 14. And the stored PII data mapping It’s something like this 14

Hinweis der Redaktion

  1. T In the initial phase of the project we found the Data Scientists struggling to get data access in the bank. Instead of the database access dumps are often provided, and any change in the dump requires another set of approvals.