SlideShare ist ein Scribd-Unternehmen logo
1 von 31
1Confidential
Streaming Data Integration
with Apache Kafka
2Confidential
About Gwen
Gwen Shapira – System Architect @Confluent
PMC @ Apache Kafka
Moving data round since 2000
Previously:
• Software Engineer @ Cloudera
• Oracle Database Consultant
Find me:
• gwen@confluent.io
• @gwenshap
3Confidential
The Plan
1. What is Data Integration About?
2. How things changed?
3. What is difficult and important?
4. How we solve things in Kafka?
4Confidential
Data Integration
Making sure the right data
Gets to the right places
5Confidential
10 years ago…
Informatica
DataStage
Manual Optimizations
6Confidential
5 years ago…
7Confidential
8Confidential
9Confidential
Today…
• Everything streaming
• Everything real-time
• Everything in-memory
• Everything containers
• Everything clouds
10Confidential
These Things Matter
• Reliability – Losing data is (usually) not OK.
• Exactly Once vs At Least Once
• Timeliness
• Push vs Pull
• High throughput, Varying throughput
• Compression, Parallelism, Back Pressure
• Data Formats
• Flexibility, Structure
• Security
• Error Handling
11Confidential
12Confidential
After: Stream Data Platform with Kafka
 Distributed  Fault Tolerant  Stores Messages
Search Security
Fraud Detection Application
User Tracking Operational Logs Operational MetricsEspresso Cassandra Oracle
Hadoop Log Search Monitoring
Data
Warehouse
Kafka
 Processes Streams
13Confidential
14Confidential
1
4
15Confidential
1
5
16Confidential
1
6
17Confidential
1
7
18Confidential
Introducing
Kafka Connect
Large-scale streaming data import/export for Kafka
19Confidential
20Confidential
Overview of Connect
1. Install a cluster of Workers
2. Download / Build and install Connector Plugins
3. Use REST API to Start and Configure Connectors
4. Connectors start Tasks. Tasks run inside Workers and copy data.
21Confidential
22Confidential
23Confidential
24Confidential
25Confidential
26Confidential
27Confidential
28Confidential
30Confidential
31Confidential
32Confidential
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
confluent
 
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
confluent
 

Was ist angesagt? (20)

Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
The Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scaleThe Many Faces of Apache Kafka: Leveraging real-time data at scale
The Many Faces of Apache Kafka: Leveraging real-time data at scale
 
Data Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache KafkaData Pipelines Made Simple with Apache Kafka
Data Pipelines Made Simple with Apache Kafka
 
The Future of ETL Isn't What It Used to Be
The Future of ETL Isn't What It Used to BeThe Future of ETL Isn't What It Used to Be
The Future of ETL Isn't What It Used to Be
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
 
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data PlatformStream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know URP? Excuse You! The Three Metrics You Have to Know
URP? Excuse You! The Three Metrics You Have to Know
 
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache KafkaCommon Patterns of Multi Data-Center Architectures with Apache Kafka
Common Patterns of Multi Data-Center Architectures with Apache Kafka
 
Apache Kafka: Past, Present and Future
Apache Kafka: Past, Present and FutureApache Kafka: Past, Present and Future
Apache Kafka: Past, Present and Future
 
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
Building a newsfeed from the Universe: Data streams in astronomy (Maria Patte...
 
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming ApplicationsMetrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
Metrics Are Not Enough: Monitoring Apache Kafka and Streaming Applications
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
Event Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on HerokuEvent Driven Architectures with Apache Kafka on Heroku
Event Driven Architectures with Apache Kafka on Heroku
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Confluent building a real-time streaming platform using kafka streams and k...
Confluent   building a real-time streaming platform using kafka streams and k...Confluent   building a real-time streaming platform using kafka streams and k...
Confluent building a real-time streaming platform using kafka streams and k...
 
Apache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York TimesApache Kafka® Delivers a Single Source of Truth for The New York Times
Apache Kafka® Delivers a Single Source of Truth for The New York Times
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
 

Andere mochten auch

Andere mochten auch (20)

Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
 
Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
Queues, Pools, Caches
Queues, Pools, CachesQueues, Pools, Caches
Queues, Pools, Caches
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
Change Data Capture using Kafka
Change Data Capture using KafkaChange Data Capture using Kafka
Change Data Capture using Kafka
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Advanced Shell Scripting for Oracle professionals
Advanced Shell Scripting for Oracle professionalsAdvanced Shell Scripting for Oracle professionals
Advanced Shell Scripting for Oracle professionals
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 

Ähnlich wie Streaming Data Integration - For Women in Big Data Meetup

CERN Data Centre Evolution
CERN Data Centre EvolutionCERN Data Centre Evolution
CERN Data Centre Evolution
Gavin McCance
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Benoit Perroud
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
confluent
 

Ähnlich wie Streaming Data Integration - For Women in Big Data Meetup (20)

CA Technologies Customer Presentation
CA Technologies Customer PresentationCA Technologies Customer Presentation
CA Technologies Customer Presentation
 
Operational Buddhism: Building Reliable Services From Unreliable Components -...
Operational Buddhism: Building Reliable Services From Unreliable Components -...Operational Buddhism: Building Reliable Services From Unreliable Components -...
Operational Buddhism: Building Reliable Services From Unreliable Components -...
 
Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Deploying Big Data Platforms
Deploying Big Data PlatformsDeploying Big Data Platforms
Deploying Big Data Platforms
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
 
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
[AIIM17] It’s Harvest Time in the Information Garden - Dan Antion
 
The Cloud, Cold Chain, and Compliance
The Cloud, Cold Chain, and ComplianceThe Cloud, Cold Chain, and Compliance
The Cloud, Cold Chain, and Compliance
 
CERN Data Centre Evolution
CERN Data Centre EvolutionCERN Data Centre Evolution
CERN Data Centre Evolution
 
Soft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End UsersSoft-Shake 2013 : Enabling Realtime Queries to End Users
Soft-Shake 2013 : Enabling Realtime Queries to End Users
 
Science for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing DataScience for the Future: Strategies for Moving and Sharing Data
Science for the Future: Strategies for Moving and Sharing Data
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Virtual Desktops on AWS by Mike Burke, Farm Credit Canada
Virtual Desktops on AWS by Mike Burke, Farm Credit CanadaVirtual Desktops on AWS by Mike Burke, Farm Credit Canada
Virtual Desktops on AWS by Mike Burke, Farm Credit Canada
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Mediawiki to Confluence migration
Mediawiki to Confluence migrationMediawiki to Confluence migration
Mediawiki to Confluence migration
 
IBM Aspera for High Speed Data Migration to Your AWS Cloud - DEM06-S - Anahei...
IBM Aspera for High Speed Data Migration to Your AWS Cloud - DEM06-S - Anahei...IBM Aspera for High Speed Data Migration to Your AWS Cloud - DEM06-S - Anahei...
IBM Aspera for High Speed Data Migration to Your AWS Cloud - DEM06-S - Anahei...
 
Connecting Akka with Oracle Event Hub Cloud Service
Connecting Akka with Oracle Event Hub Cloud ServiceConnecting Akka with Oracle Event Hub Cloud Service
Connecting Akka with Oracle Event Hub Cloud Service
 
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
Webinar with SnagAJob, HP Vertica and Looker - Data at the speed of busines s...
 
Delivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with GlobusDelivering a Campus Research Data Service with Globus
Delivering a Campus Research Data Service with Globus
 
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
 
Pinterest hadoop summit_talk
Pinterest hadoop summit_talkPinterest hadoop summit_talk
Pinterest hadoop summit_talk
 

Mehr von Gwen (Chen) Shapira

Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
Gwen (Chen) Shapira
 
Visualizing database performance hotsos 13-v2
Visualizing database performance   hotsos 13-v2Visualizing database performance   hotsos 13-v2
Visualizing database performance hotsos 13-v2
Gwen (Chen) Shapira
 

Mehr von Gwen (Chen) Shapira (17)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Incredible Impala
Incredible Impala Incredible Impala
Incredible Impala
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Ssd collab13
Ssd   collab13Ssd   collab13
Ssd collab13
 
Integrated dwh 3
Integrated dwh 3Integrated dwh 3
Integrated dwh 3
 
Visualizing database performance hotsos 13-v2
Visualizing database performance   hotsos 13-v2Visualizing database performance   hotsos 13-v2
Visualizing database performance hotsos 13-v2
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
 

Kürzlich hochgeladen

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 

Kürzlich hochgeladen (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 

Streaming Data Integration - For Women in Big Data Meetup

Hinweis der Redaktion

  1. More types of data stores with specialized functionality – e.g. rise of NoSQL systems handling document-oriented and columnar stores. A lot more sources of data. Rise of secondary data stores and indexes – e.g. Elasticsearch for efficient text-based queries, graph DBs for graph-oriented queries, time series databases. A lot more destinations for data, and a lot of transformations along the way to those destinations. Real-time: data needs to be moved between these systems continuously and at low latency.
  2. Unfortunately, as you build up large, complex data pipelines in an ad hoc fashion by connecting different data systems that need copies of the same data with one-off connectors for those systems, or build out custom connectors for stream processing frameworks to handle different sources and sinks of streaming data, we end up with a giant, unmaintainable mess. This mess has a huge impact on productivity and agility once you get past just a few systems. Adding any new data storage system or stream processing job requires carefully tracking down all the downstream systems that might be affected, which may require coordinating with dozens of teams and code spread across many repositories. Trying to change one data source’s data format can impact many downstream systems, yet there’s no simple way to discover how these jobs are related. This is a real problem that we’re seeing across a variety of companies today. We need to do something to simplify this picture. While Confluent is working to build out a number of tools to help with these challenges, today I want to focus on how we can standardize and simplify constructing these data pipelines so that, at a minimum, we reduce operational complexity and make it easier to discover and understand the full data pipeline and dependencies.
  3. One step towards getting to a separation of concerns is being able to decouple the E, T, and L steps. Kafka, when used as shown here, can help us do that. The vision of Kafka when originally built at LinkedIn was for it to act as a common hub for real-time data. When streaming data from data stores like RDBMS or K/V store, we produce data into Kafka, making it available to as many downstream consumers as want it. Save data to other systems like secondary indexes and batch storage systems, which are implemented with consumers. Stream processing frameworks and custom consumer apps fit in by being both consumers and producers – reading data from Kafka transforming it, and then possibly publishing derived data back into Kafka. Using this model can simplify the problem as we’re now always interacting with Kafka.
  4. To set some context, I want to just quickly list a few of the features that make it possible for Kafka to handle data at this scale. We’ll come back to many of these properties when looking at Kafka Connect. At its core, pub/sub messaging system rethought as distributed commit log. Based on an append-only and sequentially accessed log, which results in very high performance reading and writing data. Extends the model to a *partitioned stream* model for a single logical topic of data, which allows for distribution of data on the brokers and parallelism in both writes and reads. In order to still provide organization and ordering within a single partition, it guarantees ordering within each partition and uses keys to determine which partition to put data in. As part of its append-only approach, it decouples data consumption from data retention policy, e.g. retaining data for 7 days or until we have 1TB in a topic. This both gets rid of individual message acking and allows multiple consumption of the same data, i.e. pub/sub, by simply tracking offsets in the stream. Because data is split across partitions, we can also parallelize consumption and make it elastically scalable with Kafka’s unique automatically balanced consumer groups.
  5. But what exactly is Kafka? At high level, "just" another pub/sub message queue A few key features make it scale to handle the requirements of a stream data platform Multiple consumers can read the same data, and can be at different offsets in the log. Consuming data doesn't delete it from the log. Instead, Kafka use time- or data size- based retention. Your data will stick around for, e.g., 7 days or until you have 100GB. This retention policy is simple and avoids having to keep accounting info for individual messages.
  6. Topics are partitioned so they can scale across multiple servers Partitions are also replicated for fault tolerance
  7. As I mentioned before, Kafka is multi-subscriber where the same topic can be consumed by multiple groups of consumers where each consumer group can subscribe to read the full copy of data. Furthermore, every consumer group can have multiple consumer processes distributed over several machines and Kafka takes care of assigning the partitions of the subscribed topics evenly amongst the consumer processes in a group so that at all times, every partition of a subscribed topic is being consumed by some consumer process within the group. In addition to being easy to scale, consumption is also fault tolerant. If one fails, the other ones automatically rebalance to pick up the load of the failed consumer instance. So it is operationally cheap to consume large amounts of data.
  8. Today, I want to introduce you to Kafka Connect, Kafka’s new large-scale, streaming data import/export tool that drastically simplifies the construction, maintenance, and monitoring of these data pipelines. Kafka Connect is part of the Apache Kafka project, open source under the Apache license, and ships with Kafka. It’s a framework for building connectors between other data systems and Kafka, and the associated runtime to run these connectors in a distributed, fault tolerant manner at scale.
  9. Goals: Focus – copying only Batteries included – framework does all the common stuff so connector developers can focus specifically on details that need to be customized for their system. This covers a lot more than many connector developers realize: beyond managing the producer or consumer, it includes challenges like scalability, recovery from faults and reasoning about delivery guarantees, serialization, connector control, monitoring for ops, and more. Standardize – configuration, status and connector control, monitoring, etc. Parallelism, scalability, fault tolerance built-in, without a lot of effort from connector developers or users. Scale – in two ways. First, scale individual connectors to copy as much data as possible – ingest an entire database rather than one table at a time. Second, scale up to organization-wide data pipelines or down to development, testing, or just copying a single log file into Kafka With these goals in mind, let’s explore the design of Kafka Connect to see how it fulfills these.
  10. At it’s core, Kafka connect is pretty simple. It has source connectors which copy data from another system into Kafka, and sink connectors that copy data from Kafka into a destination system. Here I’ve shown a couple of examples. The source and sink systems don’t necessarily have to naturally match Kafka’s data model exactly. However, we do need to be able to translate data between the two. For example, we might load data from a database in a source connector. By using a timestamp column associated with each row, we can effectively generate an ordered stream of events that are then produced into Kafka. To store data into HDFS, we might load data from one or more topics in Kafka and then write it in sequence to files in an HDFS directory, rotating files periodically. Although Kafka Connect is designed around streaming data, because Kafka acts as a good buffer between streaming and batch systems, we can use it here to load data into HDFS. Neither of these systems map directly to Kafka’s model, but both can be adapted to the concepts of streams with offsets. More about this in a minute. The most important design point for Kafka Connect is that one half of a connection is always Kafka – the destination for sources, or the source of data for sink connectors. This allows the framework to handle the common functionality of connectors while maintaining the ability to automatically provide scalability, fault tolerance, and delivery guarantees without requiring a lot of effort from connector developers. This key assumption is what makes it possible for Kafka Connect to get a better set of tradeoffs than the systems I mentioned earlier.
  11. So now, coming back to the model that connectors need to map to. Just as Kafka’s data model enables certain features around scalability, Kafka Connect’s data model can as well. Kafka Connect requires every connector to map to a “partitioned stream” model. The basic idea is a generalization of Kafka’s data model of topics and partitions. This mapping is defined by the input system for the connector – the source system for source connectors, and Kafka topics for sink connectors -- and has the following: A set of partitions which divide the whole set of data logically. Unlike Kafka, the number of partitions can potentially be very large and may be more dynamic than we would expect with Kafka. Each partition contains an ordered sequence of events/messages. Under the hood these are key/value pairs with byte[], but Kafka Connect requires that they can be converted into a generic data API Each event/message has a unique offset representing its position in the partition. Since the mapping is determined by the input system, these offsets must be meaningful to that system – these may be quite different from the Kafka offsets you’re used to.
  12. To give a more concrete example, we can revisit the database example from earlier. Previously I only showed a single table, but if we consider the database as a whole, we can apply this model to copy the entire database. We partition by table, delivering each into its own Kafka topic. Each event represents a row that we’ve inserted into the database. The offsets are IDs or timestamps, or even more complex representations like a combination of ID and timestamp. Although there isn’t *actually* a stream for each table, we can effectively construct one by querying the database and ordering results according to specific rules. As a result of this model, we can see a few properties emerging: First, we have a built-in concept of parallelism, a requirement for automatically providing scalable data copying. We’re going to be able to distribute processing of partitions across multiple hosts. Second, this model encourages making copying broad by default – partitioned streams should cover the largest logical collection of data. Finally, offsets provide an easy way to track which data has been processed and which still needs to be copied. In some cases, mapping from the native data model to streams may not be simple; however, a bit of effort in creating this mapping pays off by providing a common framework and implementation for tracking which data has been copied. Again, we’ll revisit this a bit later, but this allows the framework to handle a lot of the heavy lifting with regards to delivery semantics.
  13. Partitioned streams are the logical data model, but they don’t directly map to physical parallelism, or threads, in Kafka Connect. In the case of the database connector, a direct mapping might seem reasonable. However, some connectors will have a much larger number of partitions that are much finer-grained. For example, consider a connector for collecting metrics data – each metric might be considered its own partition, resulting in tens of thousands of partitions for even a small set of application servers. However, we do want to exploit the parallelism provided by partitions. Connectors do this by assigning partitions to tasks. Tasks are, simply, threads of control given to the connector code which perform the actual copying of data. Each connector is given a thread it can use to monitor the input system for the active set of partitions. Remember that this set can be dynamic, so continuous monitoring is sometimes needed to detect changes to the set of partitions. When there are changes, the connector notifies the framework so it can reconfigure the current set of tasks. Then, each task is given a dedicated thread for processing. The connector assigns a subset of partitions to each task and the task is the one that actually copies the data for that partition. Given the assignment, the connector implementer handles the reading or writing data from that set of partitions. And how do we decide how many tasks to generate? That’s up to the user, and it’s the primary way to control the total resources used by the connector. Since each task corresponds to a thread, the user can choose to dynamically increase or decrease the maximum number of tasks the connector may create in order to scale resource usage up or down. So now we have some set of threads, but where do they actually execute? Kafka Connect has two modes of execution.
  14. In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
  15. In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
  16. In contrast, distributed mode can scale up while providing distribution and fault tolerance. Recall that each connector or task is a thread, and we’re considering each to be approximately equal in terms of resource usage. Connectors and tasks are auto-balanced across workers. Failures automatically handled by redistributing work, and you can easily scale the cluster up or down by adding more workers. Cool implementation note: reuses group membership functionality of consumer groups. Note how if you replace “worker” with “consumer” and “task” with “topic partition”, the things it is doing look largely the same: assigning tasks to workers, detecting when a worker is added or fails, and rebalancing the work. Kafka already provides support for doing a lot of this, so by leveraging the existing implementation and coordinating through Kafka’s group functionality (with internal data stored in Kafka topics), Kafka Connect can provide this functionality in a relatively small code footprint. All of this functionality can be accessed via REST API – submit connectors, see their status, update configs, and so on. Finally, note that Kafka Connect does not own the process management at all. We don’t want to make assumptions about using Mesos, YARN, or any other tool because that would unnecessarily limit Kafka Connect’s usage. Kafka Connect will work out of the box in any of these cluster management systems, or with orchestration tools, or if you just manage your processes with your own tooling.
  17. I want to mention two important features that also simplify both connector developer’s and user’s lives. The first feature is offset management, which provides for standardized data delivery guarantees. Delivery guarantees are actually rarely provided in many other systems. They generally offer some sort of best effort, but unreliable, delivery. Ironically, stream processing frameworks often do a better job than tools specifically designed for data copying. Kafka Connect handles offset checkpointing for connectors, and this fits in as a natural extension to Kafka’s offset commit functionality. For sources this works with offsets that have complex structure (e.g. timestamps + autoincrementing IDs in a database) and requires no implementation support from the connector beyond defining the offsets and being able to start reading from a saved offset. For sinks, we can leverage Kafka’s existing offset functionality, but in order to ensure data is completely written, sinks must also support a flush operation. Commits are automatically processed periodically. By default, this mode of managing offsets will provide at least once delivery; internally both sources and sinks are simply flushing all data to the output and the committing offsets. Note that some connectors will opt out of this functionality in order to provide even stronger guarantees. For example, the HDFS connector manages its own offsets because (carefully) tracking them in HDFS along with the data allows for exactly-once delivery.
  18. The second feature I want to mention are converters. Serialization formats may seem like a minor detail, but not separating the details of data serialization in Kafka from the details of source or sink systems results in a lot of inefficiency: A lot of code for doing simple data conversions are duplicated across a large number of ad hoc connector implementations. Each connector ultimately contains its own set of serialization options as it is used in more environments – JSON, Avro, Thrift, protobufs, and more. Much like the serializers in Kafka’s producer and consumer, the Converters abstract away the details of serialization. Converters are different because they guarantee data is transformed to a common data API defined by Kafka Connect. This API supports both schema and schemaless data, common primitive data types, complex types like structs, and logical type extensions. By sharing this API, connectors write one set of translation code and Converters handle format-specific details. For example, the JDBC connector can easily be used to produce either JSON or Avro to Kafka, without any format-specific code in the connector.
  19. Kafka Connect provides the framework, but I want to spend a few minutes describing the current state of the connector ecosystem. While the framework ships with Apache Kafka, connectors use a federated approach to development. Confluent helped kick off connector development with a few key open source connectors – JDBC for importing data from any relational database and HDFS, for exactly once delivery of data into HDFS and Hive. Confluent will be continuing to add more open source connectors. We’ve also started tracking connectors that the community has been developing on a page we’re calling the Connector Hub. We’ve already got a dozen or so connectors, and more are popping up every week. We’ll be working to make this index as useful to users as possible, offering information about the current state of the connector implementations and feature sets.
  20. With all these pieces you can see how we can tie together Kafka and Kafka Connect with stream processing frameworks and applications to not only simplify building these data pipelines and solve data integration challenges, but also transform how your company manages its data pipelines. Kafka provides the central hub for real-time data and Kafka Connect simplifies operationalization: one service to maintain, common metrics, common monitoring, and agnostic to your choice of process and cluster management. You can centrally managed Kafka Connect cluster running in distributed mode, and accessed via REST API, allowing your ops team to provide data integration as a service to your entire organization. For developers who want to build a complex data pipeline, they can submit jobs to copy data into and out of Kafka – it’s zero coding (assuming a connector is available) Then, they can easily leverage either the traditional clients or stream processing frameworks to transform that data. The output is stored back into another Kafka topic or served up directly. As a side benefit, standardizing on Kafka encourages reuse of existing data (both raw and transformed). Providing this service not only makes it easy to build your *own* complex data pipeline, it encourages other people in the org to build on top of your existing work. Confluent Platform also provides additional tools that make this setup even more powerful. For example, the schema registry controls the format of data in each topic, and besides ensuring data quality and compatibility, it also encourages decoupling of teams by allowing anyone to discover what data is in a topic, grab its schema, and immediately start utilizing that data without ever adding coordination overhead with another team. A stream data platform built around Kafka and Kafka Connect allows you to scale to handle your entire organization’s real-time data, while maintaining simple management and easy operationalization of your data pipeline.