Design of a lightweight set of data pipelines to scrub PII information.
Scrubbing PII information from data brings ease of sharing data.
It also helps organisations to confidently push data outside organisation for large scale analytics on the cloud.
2. Benefits of masking and tokenization of PII
data.
2
1. Ease of access for the data analysts
2. Ensuring the stakeholders from various systems feel safe in providing access to data.
3. Ease of sharing the data.
3. Primary transformations in our scrubbing
service
Building blocks for other transformations
3
Mask
B68 Shastri Nagar
Bhopal, 462003
xxxxx
Tokenise
Acc no:
00014590900
ac8a58d5da5fd35a7f
60cb906a724589e76
1f25e8fbea4346fd01
23ea351ea51
Derive
Age: 26
Age 20-30
Wipes out
information
completely
Useful for preserving
uniqueness
Useful for selective
preservation of
information
4. First glimpse:
- We needed cron to invoke this python script.
- It will pick up files from the disk, scrub, tokenize and write to S3.
Initial set of ideas:
1. Python with simple CSV reader.
2. One step ahead, maybe we can use pandas, provides good support for various
readers, and dataframe abstraction.
A simple scrubbing of CSVs, write the output to S3
4
5. Second glance:
Since, we had chosen pandas, with sqlalchemy interface we were able to achieve this.
We started to notice few behaviours:
1. Too much RAM was being consumed.
2. Reads were slow.
Usual suspects:
1. Intermediate copies.
2. Loading more data then needed.
3. No lazy evaluation.
4. No parallelization
Possible remedies:
1. Load less data aka chunking
2. Use efficient data types.
We need to connect to DBs.
5
6. Final reckoning
1. We need to connect to plethora of databases systems.
2. We might end up ingesting various raw formats like excels etc.
3. Orchestration would not be as simple as cron.
4. Backfilling of huge amounts of data.
5. We started to see pandas teeter under requirements were we needed to join
multiple data sources and perform aggregates, before pushing the scrubbed
dataset to S3.
We realised pandas won’t scale. We had to migrate to spark.
Towards the sprint end, some realisations
6
7. You do not need TBs of data for using spark.
Using spark does not necessarily mean you would have to invest and maintain into a spark cluster. If
your scale is small, start with smaller spark setup. You can gradually step up as the scale of data
increases
You can bring a pet elephant, and feed it accordingly.
7
Requirement Deployment mode
One job at a time and data size in few GBs. Local mode, (single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few GBs.
Standalone (Single node) (local storage)
Multiple jobs with resource allocation in FIFO scheduling and
data size in few 100 GBs.
Standalone (Multi-node or if a big machine using containers)
(local storage)
Multiple jobs with resource allocation in advanced scheduling
and data size in few 100 GBs.
Cluster mode (with YARN, Mesos etc as schedulers)
(distributed storage)
8. Ways to tokenize your data
Problem statement: In place of real keys in the data we need to put tokens, for given
key the token should be same, and given a key we should be able to tell the real data.
With spark in our arsenal, the next problem to solve.
8
Needs mapping table
No need of mapping
table
create
lookup
Encryption
Hashing with salt
Hashing with pepper
Using UUIDs
9. Using UUIDs
1. Though using UUIDs for generating token are better as they are completely
random and has no relation with the original data, but they pose the problem of
fetchOrCreate between parallel spark jobs.
9
Lookup Table (Cached Table or Table)
token(12232)
token(12232) token(42)
token(100)
10. Making a choice
10
Client needs
Lookup table is must and should be
encrypted, this gives them more control
over the tokenized information, than the
encryption method.
Our needs
We wanted to avoid fetchOrCreate problem
with multiple spark jobs trying to lookup
already generated token value.
We wanted the creation of tokens such that
two jobs arrive at the same token without
any coordination.
12. The current state of our scrubbing service
1. Airflow 1.10.10 on docker, with CeleryExecutor 5 workers and 1 hourly worker.
2. 48 cores CPU and 128 GB RAM.
3. Spark and airflow on simple docker-compose.
4. 103 data pipelines pushing data to AWS from 28 data sources
5. Oracle has 9,43,68,195 number of tokens.
6. Alerts using emails and prometheus, grafana setup.
7. RBAC UI
The PII gatekeeper to the cloud
12
13. Spark on docker
1. Mount all the spark related scratch directories. Spark.local.dir or docker diff to
figure out what other data the container is generating.
2. Watchout for zombies: killing jobs, or executor retries leaves zombie processes
behind, since docker has no init system. Add tini or use docker run --init.
A glimpse of few challenges you might face
13
14. And the stored PII data mapping
It’s something like this
14
TIn the initial phase of the project we found the Data Scientists struggling to get data access in the bank. Instead of the database access dumps are often provided, and any change in the dump requires another set of approvals.