Gobblin is a data integration framework that can handle both batch and streaming data. It provides a logical pipeline specification that is independent of the underlying execution model. Gobblin pipelines can run in both batch and streaming modes using the same system. This allows for cost-efficient batch processing as well as low-latency streaming. The document discusses Gobblin's pipeline specification, deployment options, and roadmap including adding more streaming capabilities and improving security.
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017
1. The Data Driven Network
Kapil Surlaker
Director of Engineering
Bridging Batch and Streaming Data
Integration with Gobblin
Shirshanka Das
Gobblin team
26th Apr, 2017
Big Data Meetup
github.com/linkedin/gobblin
@ApacheGobblin
gitter.im/gobblin
2. Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built
3. SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
CERN, NerdWallet and many more…
Apache incubation under way
SFTP
Azure
StorageAzure
Storage
4. 4
Other Open Source Systems in this Space
Sqoop, Flume, Falcon, Nifi, Kafka Connect
Flink, Spark, Samza, Apex
Similar in pieces, dissimilar in aggregate
Most are tied to a specific execution model (batch / stream)
Most are tied to a specific implementation, ecosystem
(Kafka, Hadoop etc)
7. 7
WorkUnit
A logical unit of work, typically bounded but not necessary.
Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200
HDFS Folder: /data/Login, File: part-0.avro
Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
9. 9
Task: A unit of execution that operates on a WorkUnit
Extracts records from the source, writes to the destination
Ends when WorkUnit is exhausted of records
(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
10. 10
Extractor: A provider of records given a WorkUnit
Connects to Data Source
Deserializer of records
11. 11
Converter: A 1:N mapper of input records to output records
Multiple converters can be chained
(e.g. Avro <-> JSON, Schema project, Encrypt)
12. 12
Quality Checker: Can check if the quality of the output is
satisfactory
Row-level (e.g. time value check)
Task-level (e.g. audit check, schema compatibility)
13. 13
Writer: Writes to the destination
Connection to the destination, Serializer of records
Sync / Async
e.g. FsWriter, KafkaWriter, CouchbaseWriter
14. 14
Publisher: Finalizes / Commits the data
Used for destinations that support atomicity
(e.g. move tmp staging directory to final
output directory on HDFS)
21. Gobblin: Pipeline Deployment
Bare Metal / AWS / Azure / VM
Standalone:
Single Instance
Small Medium Large
AWS (EC2)
Hadoop (YARN / MR)
Standalone Cluster
Pipeline Specification
Static Cluster Elastic ClusterOne Box
One Spec
Multiple Environments
23. Execution Model: Batch versus Streaming
Streaming
Determine work streams, Run continuously, Checkpoint periodically
+ Low latency
- Higher-cost because it is harder to provision
accurately
- More sophistication needed to deal with change
31. A Streaming Pipeline Spec: Kafka 2 Kafka
# A sample pull file that copies an input Kafka topic and
# produces to an output Kafka topic with sampling
job.name=Kafka2KafkaStreaming
job.group=Kafka
job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
Pipeline Name, Description
32. job.description=This is a job that runs forever, copies an input Kafka
topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.com
mon.serialization.StringDeserializer
gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.co
mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
# Sample 10% of the records
Source, configuration
A Streaming Pipeline Spec: Kafka 2 Kafka
33. mmon.serialization.ByteArrayDeserializer
gobblin.streaming.kafka.topic.singleton=test
kafka.brokers=localhost:9092
# Sample 10% of the records
converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=test_copied
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm
on.serialization.ByteArraySerializer
A Streaming Pipeline Spec: Kafka 2 Kafka
Converter, configuration
34. # Sample 10% of the records
converter.classes=gobblin.converter.SamplingConverter
converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder
writer.kafka.topic=test_copied
writer.kafka.producerConfig.bootstrap.servers=localhost:9092
writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm
on.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
A Streaming Pipeline Spec: Kafka 2 Kafka
Writer, configuration
Publisher
37. Active Workstreams in Gobblin
Gobblin as a Service
Global orchestrator with REST API for submitting logical flow specifications
Logical flow specifications compile down to physical pipeline specs
Global Throttling
Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w,
Hadoop namenode etc.)
Generic: can be used outside Gobblin
Metadata driven
Integration with Metadata Service (c.f. WhereHows)
Policy driven replication, permissions, encryption etc.
38. Roadmap
Final LinkedIn Gobblin 0.10.0 release
Apache Incubator code donation and release
More Streaming runtimes
Integration with Apache Samza, LinkedIn Brooklin
GDPR Compliance: Data purge for Hadoop and other systems
Security improvements
Credential storage, Secure specs