SlideShare a Scribd company logo
1 of 26
1
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Real Time Data Ingest into Hadoop
using Flume
Hari Shreedharan
Committer and PMC Member, Apache Flume
Software Engineer, Cloudera
What is Flume
• Collection, Aggregation of streaming Event Data
• Typically used for log data
• Significant advantages over ad-hoc solutions
• Reliable, Scalable, Manageable, Customizable and High
Performance
• Declarative, Dynamic Configuration
• Contextual Routing
• Feature rich
• Fully extensible
2
Core Concepts: Event
An Event is the fundamental unit of data transported by Flume
from its point of origination to its final destination. Event is a
byte array payload accompanied by optional headers.
• Payload is opaque to Flume
• Headers are specified as an unordered collection of string key-
value pairs, with keys being unique across the collection
• Headers can be used for contextual routing
3
Core Concepts: Client
An entity that generates events and sends them to one or more Agents.
• Example
• Flume log4j Appender
• Custom Client using Client SDK (org.apache.flume.api)
• Embedded Agent – An agent embedded within your application
• Decouples Flume from the system where event data is consumed
from
• Not needed in all cases
4
Core Concepts: Agent
A container for hosting Sources, Channels, Sinks and other
components that enable the transportation of events from one
place to another.
• Fundamental part of a Flume flow
• Provides Configuration, Life-Cycle Management, and
Monitoring Support for hosted components
5
Typical Aggregation Flow
6
[Client]+  Agent [ Agent]*  Destination
Core Concepts: Source
An active component that receives events from a specialized
location or mechanism and places it on one or Channels.
• Different Source types:
• Specialized sources for integrating with well-known systems.
Example: Syslog, Netcat
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro
• Require at least one channel to function
7
Sources
• Different Source types:
• Specialized sources for integrating with well-known systems.
Example: Spooling Files, Syslog, Netcat, JMS
• Auto-Generating Sources: Exec, SEQ
• IPC sources for Agent-to-Agent communication: Avro, Thrift
• Require at least one channel to function
8
Core Concepts: Channel
A passive component that buffers the incoming events until they are
drained by Sinks.
• Different Channels offer different levels of persistence:
• Memory Channel: volatile
• Data lost if JVM or machine restarts
• File Channel: backed by WAL implementation
• Data not lost unless the disk dies.
• Eventually, when the agent comes back data can be accessed.
• Channels are fully transactional
• Provide weak ordering guarantees
• Can work with any number of Sources and Sinks.
9
Core Concepts: Sink
An active component that removes events from a Channel and
transmits them to their next hop destination.
• Different types of Sinks:
• Terminal sinks that deposit events to their final destination. For
example: HDFS, HBase, Morphline-Solr, Elastic Search
• Sinks support serialization to user’s preferred formats.
• HDFS sink supports time-based and arbitrary bucketing of data while
writing to HDFS.
• IPC sink for Agent-to-Agent communication: Avro, Thrift
• Require exactly one channel to function
10
Flow Reliability
Reliability based on:
• Transactional Exchange between Agents
• Persistence Characteristics of Channels in the Flow
Also Available:
• Built-in Load balancing Support
• Built-in Failover Support
11
Flow Reliability
Normal Flow
Communication Failure between Agents
Communication Restored, Flow back to Normal
12
Flow Handling
Channels decouple impedance of upstream and downstream
• Upstream burstiness is damped by channels
• Downstream failures are transparently absorbed by channels
 Sizing of channel capacity is key in realizing these benefits
13
Configuration
• Java Properties File Format
# Comment line
key1 = value
key2 = multi-line 
value
• Hierarchical, Name Based Configuration
agent1.channels.myChannel.type = FILE
agent1.channels.myChannel.capacity = 1000
• Uses soft references for establishing associations
agent1.sources.mySource.type = HTTP
agent1.sources.mySource.channels = myChannel
14
Configuration
Typical Deployment
• All agents in a specific
tier could be given the
same name
• One configuration file
with entries for three
agents can be used
throughout
15
Contextual Routing
Achieved using Interceptors and Channel Selectors
16
Interceptors
Interceptor
An Interceptor is a component applied to a source in pre-specified order
to enable decorating and filtering of events where necessary.
• Built-in Interceptors allow adding headers such as timestamps,
hostname, static markers etc.
• Custom interceptors can introspect event payload to create specific
headers where necessary
17
Contextual Routing
Channel Selector
A Channel Selector allows a Source to select one or more
Channels from all the Channels that the Source is configured with
based on preset criteria.
• Built-in Channel Selectors:
• Replicating: for duplicating the events
• Multiplexing: for routing based on headers
18
Contextual Routing
• Terminal Sinks can directly use Headers to make destination
selections
• HDFS Sink can use headers values to create dynamic path for files
that the event will be added to.
• Some headers such as timestamps can be used in a more
sophisticated manner
• Custom Channel Selector can be used for doing specialized
routing where necessary
19
Load Balancing and Failover
Sink Processor
A Sink Processor is responsible for invoking one sink from an
assigned group of sinks.
• Built-in Sink Processors:
• Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN
or Custom selection algorithm
• Failover Sink Processor
• Default Sink Processor
20
Sink Processor
• Invoked by Sink Runner
• Acts as a proxy for a Sink
21
Sink Processor Configuration
• A Sink can exist in at most one group
• A Sink that is not in any group is handled via Default Sink
Processor
• Caution:
Removing a Sink Group does not make the sinks inactive!
22
Client API
• Simple API that can be used to send data to Flume agents.
• Simplest form – send a batch of events to one agent.
• Can be used to send data to multiple agents in a round-robin,
random or failover fashion (send data to one till it fails).
• Java only.
• flume.thrift can be used to generate code for other
languages.
• Use with Thrift source.
23
Embedded Agent
• Client API throws exceptions if data could not be sent.
• Applications may not be able to tolerate this.
• Embedded Agent – A (limited) Flume agent within your application
• Has a channel – so buffers data in-memory or on-disk till the data is
sent or the channel is full.
• Throws exceptions only if the channel is full (or error writing to
channel).
• Cushion for application if something causes data to be stuck within
the application
• Supports sending data to other Flume agents only, no HDFS, HBase
etc.
24
Summary
• Clients send Events to Agents
• Agents hosts number Flume components – Source, Interceptors, Channel
Selectors, Channels, Sink Processors, and Sinks.
• Sources and Sinks are active components, where as Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not
filtered, puts them on channel(s) selected by the configured Channel
Selector
• Sink Processor identifies a sink to invoke, that can take Events from a
Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery
semantics
• Channel persistence allows for ensuring end-to-end reliability
25
26
Questions?

More Related Content

What's hot

Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessDerek Collison
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed systemSunita Sahu
 
Java Collections | Collections Framework in Java | Java Tutorial For Beginner...
Java Collections | Collections Framework in Java | Java Tutorial For Beginner...Java Collections | Collections Framework in Java | Java Tutorial For Beginner...
Java Collections | Collections Framework in Java | Java Tutorial For Beginner...Edureka!
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Content provider in_android
Content provider in_androidContent provider in_android
Content provider in_androidPRITI TELMORE
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesLorenzo Alberton
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsBryan Bende
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Data Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the PlanningData Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the PlanningTechWell
 
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Alexandre Dutra
 

What's hot (20)

Scalable and Available, Patterns for Success
Scalable and Available, Patterns for SuccessScalable and Available, Patterns for Success
Scalable and Available, Patterns for Success
 
Obiee training
Obiee trainingObiee training
Obiee training
 
Java adapter
Java adapterJava adapter
Java adapter
 
Clock synchronization in distributed system
Clock synchronization in distributed systemClock synchronization in distributed system
Clock synchronization in distributed system
 
Java Collections | Collections Framework in Java | Java Tutorial For Beginner...
Java Collections | Collections Framework in Java | Java Tutorial For Beginner...Java Collections | Collections Framework in Java | Java Tutorial For Beginner...
Java Collections | Collections Framework in Java | Java Tutorial For Beginner...
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – V NOTES FINAL.pdfCS3391 -OOP -UNIT – V NOTES FINAL.pdf
CS3391 -OOP -UNIT – V NOTES FINAL.pdf
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Content provider in_android
Content provider in_androidContent provider in_android
Content provider in_android
 
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle TreesModern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
 
Unit 3
Unit 3Unit 3
Unit 3
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Java Collections Framework
Java Collections FrameworkJava Collections Framework
Java Collections Framework
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
loaders and linkers
 loaders and linkers loaders and linkers
loaders and linkers
 
Data Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the PlanningData Warehouse Testing: It’s All about the Planning
Data Warehouse Testing: It’s All about the Planning
 
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
Staying Ahead of the Curve with Spring and Cassandra 4 (SpringOne 2020)
 

Viewers also liked

2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagramWooseung Kim
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승Wooseung Kim
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료Wooseung Kim
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래Wooseung Kim
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of EverythingMichael Ducy
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Wooseung Kim
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (8)

2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Best practice instagram
Best practice   instagramBest practice   instagram
Best practice instagram
 
줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승줌인터넷 빅데이터 활용사례 김우승
줌인터넷 빅데이터 활용사례 김우승
 
2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료2012 빅데이터 big data 발표자료
2012 빅데이터 big data 발표자료
 
빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래빅데이터 플랫폼 새로운 미래
빅데이터 플랫폼 새로운 미래
 
The Future of Everything
The Future of EverythingThe Future of Everything
The Future of Everything
 
Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)Bitcoin 2.0(blockchain technology 2)
Bitcoin 2.0(blockchain technology 2)
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to Cloudera's Flume

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeYahoo Developer Network
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an IntroductionErik Schmiegelow
 
Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptxJayesh Patil
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence andbhagyavantrajapur88
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application Apache Apex
 
Ch 13: Attacking Other Users: Other Techniques (Part 1)
Ch 13: Attacking Other Users:  Other Techniques (Part 1)Ch 13: Attacking Other Users:  Other Techniques (Part 1)
Ch 13: Attacking Other Users: Other Techniques (Part 1)Sam Bowne
 
Ch 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsCh 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsSam Bowne
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...apidays
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...StreamNative
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - FlumeAnandMHadoop
 

Similar to Cloudera's Flume (20)

Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Flume basic
Flume basicFlume basic
Flume basic
 
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache FlumeFeb 2013 HUG: Large Scale Data Ingest Using Apache Flume
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Flume
FlumeFlume
Flume
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptx
 
AdroitLogic UltraESB
AdroitLogic UltraESBAdroitLogic UltraESB
AdroitLogic UltraESB
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Module notes artificial intelligence and
Module notes artificial intelligence andModule notes artificial intelligence and
Module notes artificial intelligence and
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Ch 13: Attacking Other Users: Other Techniques (Part 1)
Ch 13: Attacking Other Users:  Other Techniques (Part 1)Ch 13: Attacking Other Users:  Other Techniques (Part 1)
Ch 13: Attacking Other Users: Other Techniques (Part 1)
 
Ch 10: Attacking Back-End Components
Ch 10: Attacking Back-End ComponentsCh 10: Attacking Back-End Components
Ch 10: Attacking Back-End Components
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
INTERFACE by apidays 2023 - Data Collection Basics, Anais Dotis-Georgiou, Inf...
 
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
How Tencent Applies Apache Pulsar to Apache InLong —— A Streaming Data Integr...
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Cloudera's Flume

  • 1. 1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Real Time Data Ingest into Hadoop using Flume Hari Shreedharan Committer and PMC Member, Apache Flume Software Engineer, Cloudera
  • 2. What is Flume • Collection, Aggregation of streaming Event Data • Typically used for log data • Significant advantages over ad-hoc solutions • Reliable, Scalable, Manageable, Customizable and High Performance • Declarative, Dynamic Configuration • Contextual Routing • Feature rich • Fully extensible 2
  • 3. Core Concepts: Event An Event is the fundamental unit of data transported by Flume from its point of origination to its final destination. Event is a byte array payload accompanied by optional headers. • Payload is opaque to Flume • Headers are specified as an unordered collection of string key- value pairs, with keys being unique across the collection • Headers can be used for contextual routing 3
  • 4. Core Concepts: Client An entity that generates events and sends them to one or more Agents. • Example • Flume log4j Appender • Custom Client using Client SDK (org.apache.flume.api) • Embedded Agent – An agent embedded within your application • Decouples Flume from the system where event data is consumed from • Not needed in all cases 4
  • 5. Core Concepts: Agent A container for hosting Sources, Channels, Sinks and other components that enable the transportation of events from one place to another. • Fundamental part of a Flume flow • Provides Configuration, Life-Cycle Management, and Monitoring Support for hosted components 5
  • 6. Typical Aggregation Flow 6 [Client]+  Agent [ Agent]*  Destination
  • 7. Core Concepts: Source An active component that receives events from a specialized location or mechanism and places it on one or Channels. • Different Source types: • Specialized sources for integrating with well-known systems. Example: Syslog, Netcat • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro • Require at least one channel to function 7
  • 8. Sources • Different Source types: • Specialized sources for integrating with well-known systems. Example: Spooling Files, Syslog, Netcat, JMS • Auto-Generating Sources: Exec, SEQ • IPC sources for Agent-to-Agent communication: Avro, Thrift • Require at least one channel to function 8
  • 9. Core Concepts: Channel A passive component that buffers the incoming events until they are drained by Sinks. • Different Channels offer different levels of persistence: • Memory Channel: volatile • Data lost if JVM or machine restarts • File Channel: backed by WAL implementation • Data not lost unless the disk dies. • Eventually, when the agent comes back data can be accessed. • Channels are fully transactional • Provide weak ordering guarantees • Can work with any number of Sources and Sinks. 9
  • 10. Core Concepts: Sink An active component that removes events from a Channel and transmits them to their next hop destination. • Different types of Sinks: • Terminal sinks that deposit events to their final destination. For example: HDFS, HBase, Morphline-Solr, Elastic Search • Sinks support serialization to user’s preferred formats. • HDFS sink supports time-based and arbitrary bucketing of data while writing to HDFS. • IPC sink for Agent-to-Agent communication: Avro, Thrift • Require exactly one channel to function 10
  • 11. Flow Reliability Reliability based on: • Transactional Exchange between Agents • Persistence Characteristics of Channels in the Flow Also Available: • Built-in Load balancing Support • Built-in Failover Support 11
  • 12. Flow Reliability Normal Flow Communication Failure between Agents Communication Restored, Flow back to Normal 12
  • 13. Flow Handling Channels decouple impedance of upstream and downstream • Upstream burstiness is damped by channels • Downstream failures are transparently absorbed by channels  Sizing of channel capacity is key in realizing these benefits 13
  • 14. Configuration • Java Properties File Format # Comment line key1 = value key2 = multi-line value • Hierarchical, Name Based Configuration agent1.channels.myChannel.type = FILE agent1.channels.myChannel.capacity = 1000 • Uses soft references for establishing associations agent1.sources.mySource.type = HTTP agent1.sources.mySource.channels = myChannel 14
  • 15. Configuration Typical Deployment • All agents in a specific tier could be given the same name • One configuration file with entries for three agents can be used throughout 15
  • 16. Contextual Routing Achieved using Interceptors and Channel Selectors 16
  • 17. Interceptors Interceptor An Interceptor is a component applied to a source in pre-specified order to enable decorating and filtering of events where necessary. • Built-in Interceptors allow adding headers such as timestamps, hostname, static markers etc. • Custom interceptors can introspect event payload to create specific headers where necessary 17
  • 18. Contextual Routing Channel Selector A Channel Selector allows a Source to select one or more Channels from all the Channels that the Source is configured with based on preset criteria. • Built-in Channel Selectors: • Replicating: for duplicating the events • Multiplexing: for routing based on headers 18
  • 19. Contextual Routing • Terminal Sinks can directly use Headers to make destination selections • HDFS Sink can use headers values to create dynamic path for files that the event will be added to. • Some headers such as timestamps can be used in a more sophisticated manner • Custom Channel Selector can be used for doing specialized routing where necessary 19
  • 20. Load Balancing and Failover Sink Processor A Sink Processor is responsible for invoking one sink from an assigned group of sinks. • Built-in Sink Processors: • Load Balancing Sink Processor – using RANDOM, ROUND_ROBIN or Custom selection algorithm • Failover Sink Processor • Default Sink Processor 20
  • 21. Sink Processor • Invoked by Sink Runner • Acts as a proxy for a Sink 21
  • 22. Sink Processor Configuration • A Sink can exist in at most one group • A Sink that is not in any group is handled via Default Sink Processor • Caution: Removing a Sink Group does not make the sinks inactive! 22
  • 23. Client API • Simple API that can be used to send data to Flume agents. • Simplest form – send a batch of events to one agent. • Can be used to send data to multiple agents in a round-robin, random or failover fashion (send data to one till it fails). • Java only. • flume.thrift can be used to generate code for other languages. • Use with Thrift source. 23
  • 24. Embedded Agent • Client API throws exceptions if data could not be sent. • Applications may not be able to tolerate this. • Embedded Agent – A (limited) Flume agent within your application • Has a channel – so buffers data in-memory or on-disk till the data is sent or the channel is full. • Throws exceptions only if the channel is full (or error writing to channel). • Cushion for application if something causes data to be stuck within the application • Supports sending data to other Flume agents only, no HDFS, HBase etc. 24
  • 25. Summary • Clients send Events to Agents • Agents hosts number Flume components – Source, Interceptors, Channel Selectors, Channels, Sink Processors, and Sinks. • Sources and Sinks are active components, where as Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence allows for ensuring end-to-end reliability 25