First slide
1) Apache Flume is a distributed and available service, in which it can collect and move large amount of streaming data from one location to another.
2) Most frequently it will deliver the log data into HDFS.
Second slide
1) Event and Client are the logical components of flume.
2) An Event is a Singular unit of data which can be transported by Flume NG from its Source to destination.
3) Typically an Event will be composed of Zero or more headers and a body. Here the headers will be used for contextual routing. This means by using the Header definition we can rout the data to the next eligible destination.
4) Client is an Event generator. It will generate the events and send it to one or more agents.
Eg: Apache webservers, which generates continuously a huge amount of log data.
Third slide
1) Flume agent is a JVM Daemon service, which holds all Flume-NG components like Sources, Channels, Sinks...etc.
2) Here the Source will send the events to channel and channel will stored it, later the channel will send the events to sink.
Fourth slide
1) Source is an active component, which receives data from different locations and places it on one or more Channels.
2) The declaration of source component in “.conf” file of agent “a1” is listed here. In this s1 means Source component, a1 means agent.
a1.sources=s1
a1.sources.s1.type=netcat (netcat is one of the Source type)
3) There are different Source types are available like Pollable (Means Auto generating like “tail –F” command and sequencing command), event driven and Netcat.
4) Even we can write our won Source type and specify that Custom class name to source type parameter.
Fifth slide
1) A channel is a bridge between Source and Sink.
2) Channel will store the Source events and send it to Sink.
3) There are three different types of Channels like memory channel which is very fast but no guarantee for data loss. And file channel which will store the events in a file system before sending it to sink. And the third one is database channel which will store the events in database.
4) Single Channel can be connected to any number of Sources and Sinks.
Sixth slide
1) A sink receives events from one channel only.
3. Agenda
• What is Flume?
• Core flume-ng Concepts.
• Flow Reliability in Flume.
• Starting an Agent.
4. What is Flume?
• Apache Flume is a distributed and reliable service
for efficiently collecting, aggregating, and moving
large amounts of log data from one place to
another.
• Its main goal is to deliver streaming data from
applications to Apache Hadoop's HDFS(Most
probably).
5. Core Concepts: Event, Client
Event:- An Event is a singular unit of data that can
be transported by Flume NG from origin to its final
destination.
An event is composed of zero or more headers
and a body, For contextual routing.
Client:- An entity that generates events and sends
them to one or more agents.
Apache web servers - which generates huge
amount of log files on daily basis.
Logging package like a log4j appender that
directly sends events to Flume NG's source
6. Core Concepts: agent
• Flume agent is a physical JVM Daemon service,
which holds all Flume-NG components like
Sources, Channels, Sinks ..etc.
7. Core Components: Source
• Source is an active component, which receives data
from different locations and places it on one or more
Channels.
Different Source types:-
1) Pollable source(Auto-generating): Exec, SEQ
2) Event driven source: Avro source which accepts Avro
RPC calls and converts the RPC payload into a Flume
event.
3)Netcat Source: Syslog, ‘nc’ command line tool running
in server mode.
a1.sources=s1
a1.sources.s1.type=netcat
8. Core Components: Channel
• A channel is a glue between Source and Sink.
A channel may be in memory, which is fast but
makes no guarantee against data loss, or it can
be file/database (fully durable) where every
event is guaranteed to be delivered to the
connected sink even in failure cases like power
loss.
Single Channel can be connected to any number
of Sources and Sinks.
a1.channels=c1
a1.channels.c1.type=memory
9. Core Components: Sink
• For a flat Flume NG agent, sink is a destination for
data. Basically Sink will remove the events from
channel and transmits them to next eligible
destination(if exists).
Built-in Sinks:-
1)hdfs, writes events to HDFS.
2)logger, which simply logs all events received.
3)null, Auto-Consuming sinks. … etc.
a1.sinks=k1
a1.sinks.k1.type=logger
10. Interceptors
• Interceptors: An interceptor is a point in your
data flow where you can inspect and rout Flume
events. You can chain zero or more interceptors
after a source creates an event or before a sink
sends the event wherever it is destined.
11. channel selectors
• Channel selectors are responsible for how data
moves from a source to one or more channels.
There are two channel selectors.
1) A replicating channel selector (the default)
simply puts a copy of the event into each channel
assuming you have configured more than one.
2) A multiplexing channel selector can write to
different channels depending on certain header
information(Contextual routing).
12. sink processor
• A Sink Processor is responsible for invoking one
sink from an assigned group of sinks. Here the
Sink Processor is invoked by Sink runner.
Built-in Sink Processors:-
1) Load Balancing Sink Processor
2) Failover Sink Processor
3) Default Sink Processor
13. Flow Reliability in Flume
Whenever the Sink commit/end the transaction,
then only the event data will be removed from
channel(passive component).