10. (18)
Case Study – Filtering Sensitive Data
9
Has Sensitive
Data?
no
Source
Extractor
WorkUnit
Converter and
Quality Checker
Fork and Branching
Writer
DataPublisher
Writer
Sensitive Data
Filtering Converter
yes
12. (18)
State and Metadata Mgmt.
11
State Store
- Stores runtime metadata, e.g., checkpoints
(a.k.a. watermarks)
~ Carried over between job runs
- Default impl: serializes job/task states into
files, one per run.
- Allows other implementations that conform
to the interface to be plugged in.
State Store
job run #2
job run #3job run #1
SEP
2
SEP
3
SEP
2 SEP
3
EXAMPLE
14. (18)
Running Modes
13
Standalone
Runs in a single
JVM; tasks run in a
thread pool.
Scale-out with
MapReduce
Each job run launches
a MR job, using
mappers as containers
to run tasks.
Scale-out with
General
Distributed
Resource Manager
Supports long-running
continuous ingestion,
with better resource
utilization and SLA
guarantees.
YARN
*in progress
15. (18)
Gobblin in Production @ LinkedIn
• In production since 2014
• Usages
– Internal sources HDFS
• Kafka, MySQL, Dropbox, etc.
– External sources HDFS
• Salesforce, GoogleAnalytics, S3, etc.
– HDFS HDFS
• Closed member data purging
– Egress from HDFS (future work)
• Data volume
– Over a dozen data sources,
– thousands of datasets,
– tens ofTBs,
… daily.
14