Ingestion from Kafka Using Gobblin

•

0 gefällt mir•718 views

Gobblin is used at LinkedIn to ingest data from 3 Kafka clusters containing over 2500 topics and 50 billion records totaling over 15TB daily into related systems like copycat. It supports at least once, at most once, and exactly once semantics for publish and checkpoint persistence. Gobblin uses normal and two-level bin packing for load balancing across mappers and supports both deduplicated and non-deduplicated compaction with options for handling late events. More information can be found on Gobblin's GitHub page and user forum.

Technologie

ingestion from kafka using gobblin
Ziyang Liu

in production @linkedin
3 Kafka clusters
> 2500 topics
> 50 billion records
> 15TB
daily:

semantics
at least once
publish data before
persisting checkpoints
at most once
persist checkpoints
before publishing data
exactly once
publish data and persist
checkpoints atomically

load balancing
normal bin packing two level bin packing
-  Different partitions
of a topic usually go
to different mappers.
-  Less skew, more
small files & task
overhead.
-  First group certain
partitions of a topic
together.
-  More skew.

compaction
Dedup or non-dedup
Multiple levels of compaction
Multiple options for handling late events

to learn more
https://github.com/linkedin/gobblin
Kafka-HDFS Ingestion (with end-to-end examples)
Camus → Gobblin Migration
https://groups.google.com/forum/#!forum/gobblin-users
Countless active users and discussions
Post any question, get the answer usually within a day
Join us!
Data And Analytics (DNA)

Weitere ähnliche Inhalte

Was ist angesagt?

HUG Nov 2010: HDFS Raid - FacebookYahoo Developer Network

Large Scale Crawling with Apache Nutch and Friendslucenerevolution

Building a Scalable Web Crawler with HadoopHadoop User Group

RethinkdbAbhi Dey

Web scraping with nutch solr part 2Mike Frampton

HDFS Deep DiveZoltan C. Toth

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013mumrah

Why Your MongoDB Needs RedisItamar Haber

Web Crawling with Apache Nutchsebastian_nagel

StormCrawler at BristechJulien Nioche

Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy

Run a mapreduce jobsubburaj raj

Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari

Introduction to Hbase Rupak Roy

Polyglot metadata for HadoopJim Dowling

January 2011 HUG: Howl PresentationYahoo Developer Network

Exported resources design patternsYevgeny Trachtinov

Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebookyaevents

Introduction to apache nutchSigmoid

JFall 2011 no sql workshopfvanvollenhoven

Was ist angesagt? (20)

HUG Nov 2010: HDFS Raid - Facebook

Large Scale Crawling with Apache Nutch and Friends

Building a Scalable Web Crawler with Hadoop

Rethinkdb

Web scraping with nutch solr part 2

HDFS Deep Dive

Lucene InputFormat (lightning talk) - TriHUG December 10, 2013

Why Your MongoDB Needs Redis

Web Crawling with Apache Nutch

StormCrawler at Bristech

Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...

Run a mapreduce job

Accessing external hadoop data sources using pivotal e xtension framework (px...

Introduction to Hbase

Polyglot metadata for Hadoop

January 2011 HUG: Howl Presentation

Exported resources design patterns

Масштабируемость Hadoop в Facebook. Дмитрий Мольков, Facebook

Introduction to apache nutch

JFall 2011 no sql workshop

Ähnlich wie Ingestion from Kafka Using Gobblin

Power of the Log: LSM & Append Only Data Structuresconfluent

The Power of the LogBen Stopford

Kafka 101Clement Demonchy

04-Kafka.pptxMannMehta13

04-Kafka.pptxAdityaGanguly12

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das

Intro to Open Babelbaoilleach

Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...Erik Onnen

Apache Kafka DC Meetup: Replicating DB Binary Logs to KafkaMark Bittmann

Scaling application servers for efficiencyTomas Doran

What is Apache Kafka®?Eventador

What is apache Kafka?Kenny Gorman

Kfs presentationPetrovici Florin

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative

Wikipedia Data MiningShaik Khasim

Scale your Alfresco Solutions Alfresco Software

Storing and distributing dataPhil Cryer

Clustered Architecture Patterns Delivering Scalability And AvailabilityConSanFrancisco123

ES & KafkaDiego Pacheco

Union FileSystem - A Building Blocks Of a ContainerKnoldus Inc.

Ähnlich wie Ingestion from Kafka Using Gobblin (20)

Power of the Log: LSM & Append Only Data Structures

The Power of the Log

Kafka 101

04-Kafka.pptx

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...

Intro to Open Babel

Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...

Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka

Scaling application servers for efficiency

What is Apache Kafka®?

What is apache Kafka?

Kfs presentation

The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...

Wikipedia Data Mining

Scale your Alfresco Solutions

Storing and distributing data

Clustered Architecture Patterns Delivering Scalability And Availability

ES & Kafka

Union FileSystem - A Building Blocks Of a Container

Kürzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

🐬 The future of MySQL is Postgres 🐘RTylerCroy

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

CNv6 Instructor Chapter 6 Quality of Service

Unblocking The Main Thread Solving ANRs and Frozen Frames

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Handwritten Text Recognition for manuscripts and early printed texts

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

A Domino Admins Adventures (Engage 2024)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

🐬 The future of MySQL is Postgres 🐘

How to Troubleshoot Apps for the Modern Connected Worker

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Axa Assurance Maroc - Insurer Innovation Award 2024

Injustice - Developers Among Us (SciFiDevCon 2024)

Scaling API-first – The story of a global engineering organization

Exploring the Future Potential of AI-Enabled Smartphone Processors

Presentation on how to chat with PDF using ChatGPT code interpreter

Salesforce Community Group Quito, Salesforce 101

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Ingestion from Kafka Using Gobblin

1. ingestion from kafka using gobblin Ziyang Liu

2. in production @linkedin 3 Kafka clusters > 2500 topics > 50 billion records > 15TB daily:

3. related systems copycat

4. semantics at least once publish data before persisting checkpoints at most once persist checkpoints before publishing data exactly once publish data and persist checkpoints atomically

5. load balancing normal bin packing two level bin packing -  Different partitions of a topic usually go to different mappers. -  Less skew, more small files & task overhead. -  First group certain partitions of a topic together. -  More skew.

6. compaction Dedup or non-dedup Multiple levels of compaction Multiple options for handling late events

7. to learn more https://github.com/linkedin/gobblin Kafka-HDFS Ingestion (with end-to-end examples) Camus → Gobblin Migration https://groups.google.com/forum/#!forum/gobblin-users Countless active users and discussions Post any question, get the answer usually within a day Join us! Data And Analytics (DNA)

Ingestion from Kafka Using Gobblin

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Ingestion from Kafka Using Gobblin

Ähnlich wie Ingestion from Kafka Using Gobblin (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ingestion from Kafka Using Gobblin