Apache Flume

Arinto Murdopo
Josep Subirats
Group 4
EEDC 2012

Outline
● Current problem
● What is Apache Flume?
● The Flume Model
○ Flows and Nodes
○ Agent, Processor and Collector Nodes
○ Data and Control Path
● Flume goals
○ Reliability
○ Scalability
○ Extensibility
○ Manageability
● Use case: Near Realtime Aggregator

Current Problem
● Situation:
You have hundreds of services running in different servers
that produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.

● Problem:
How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable way
to do it!

What is Apache Flume?
● It is a distributed data collection service that gets
flows of data (like logs) from their source and
aggregates them to where they have to be processed.
● Goals: reliability, scalability, extensibility,
manageability.

Exactly what I needed!

The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server
logs, machine monitoring metrics...).
● Flows are comprised of nodes chained together (see
slide 7).

The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
...are optionally processed by one or more decorators...
...and then are transmitted out via a sink.

Examples: Console, Exec, Syslog, IRC,
Twitter, other nodes...

Examples: Console, local files, HDFS, S3,
other nodes...

Examples: wire batching, compression,
sampling, projection, extraction...

The Flume Model: Agent, Processor and
Collector Nodes

● Agent:
receives data from an
application.

● Processor (optional):
intermediate processing.

● Collector:
write data to permanent
storage.

The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.

The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.

Flume Goals: Reliability
Tunable Failure Recovery Modes

● Best Effort

● Store on Failure and Retry

● End to End Reliability

Flume Goals: Scalability
Horizontally Scalable Data Path

Load Balancing

Flume Goals: Scalability
Horizontally Scalable Control Path

Flume Goals: Extensibility
● Simple Source and Sink API
○ Event streaming and composition of simple
operation

● Plug in Architecture
○ Add your own sources, sinks, decorators

Flume Goals: Manageability
Centralized Data Flow Management Interface

Flume Goals: Manageability
Configuring Flume

Node: tail(“file”) | filter [ console, roll
(1000) { dfs(“hdfs://namenode/user/flume”) } ]
;
Output Bucketing
/logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt
/logs/web/2010/0715/1300/data-xxy.txt

Use Case: Near Realtime Aggregator

Conclusion
Flume is
● Distributed data collection service

● Suitable for enterprise setting

● Large amount of log data to process

References
● http://www.cloudera.
com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie
h_hadoop_log_processing/
● http://www.slideshare.net/cloudera/inside-flume
● http://www.slideshare.net/cloudera/flume-intro100715
● http://www.slideshare.net/cloudera/flume-austin-hug-21711

Apache Flume

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Apache Flume

Ähnlich wie Apache Flume (20)

Mehr von Arinto Murdopo

Mehr von Arinto Murdopo (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Apache Flume