This document provides an overview of hooking Flume up with HBase. It discusses the Flume event model and HBase data model. It describes compelling reasons to connect Flume and HBase, such as near real-time aggregation of streaming data and low latency access to aggregated data. Configuration examples are provided for the hbase() and attr2hbase() HBase sinks. The document also discusses how to write a new Flume sink plugin and lists upcoming Flume sink types. Contact information is provided at the end.
2. Who am I ?
• Big Data Ninja at Riot Games
• Flume Contributor
• Cloudera Intern Alum
• Graduated with Masters CS
from Georgia Tech.
3. What am I presenting here ?
• Flume event model
• HBase data model
• Compelling reasons to hook ‘em up
• Configuration examples
• What are the new upcoming Sinks ?
• How to write new Flume-Sink.
4. What is needed before we start ..
• Understanding of Flume’s architecture
• Usage of Flume’s abstractions such as
Plugins, Events, Sources, Sinks, Escape Sequences
and Decorators*
• Understanding of HBase and Hadoop
• Regex
• That’s it!
*http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html
6. Flume Event Model
• A Flume event has these six main fields: Unix
timestamp, Nanosecond timestamp, Priority,
Source host, Body and a Metadata table with
an arbitrary number of attribute value pairs.
• The body is the raw log entry body. The
default is to truncate the body to a maximum
of 32KB per event. This is a configurable.
• One can custom bucket attributes with help of
escape sequences.
9. Reasons For HBase Sink
• Near Real-Time aggregation of Streaming Data
• Low Latency access to the aggregated data
• Offline Big Data Analytics
10. Types of Flume HBase Sink
1. hbase(): Highly expressive
hbase("table", "rowkey", "cf1", "c1", "val1"[,"cf2", "c2", "val2", ....] {,
writeBufferSize=int, writeToWal=true|false})
2. attr2hbase(): Flexible and powerful semantics
but could be confusing (at first glance)
attr2hbase("table"[,"sysFamily"[,"writeBody"[,"attrPrefix"[,"writeBufferSize"
[,"writeToWal"]]]]])
11. How to Use a Plugin ?
• Compile. Add the jar with the new plugin
classes to flume’s classpath.
• In flume-site.xml, add the class names of the
new sources, sinks, and/or decorators to the
flume.plugin.classes property
• Restart the Flume nodes (Including Master)
• Verify that your plugin is loaded is to check if
it is displayed on this page http://flume-
master:35871/masterext.jsp
16. attr2hbase()
• Don’t have to list all possible event attributes
you want to store in HBase along with their
destination column families and qualifiers
• Source and/or decorators can produce any
(reasonable) number of attributes, with
dynamic names (e.g. depending on the values)
and they will be written into HBase
17. attr2hbase
• attr2hbase("table"[,"sysFamily"[,"writeBody"[,
"attrPrefix"[,"writeBufferSize"
[,"writeToWal"]]]]])
• sysFamily holds the name of the column
family that is used to store “system” data
(event timestamp, host, priority).
• In case this parameter is absent or equals “”,
the sink doesn’t write “system” data
18. attr2hbase
• writeBody indicates whether event body
should be written with other “system” data.
By default, (when this parameter is absent or
equals ””) the attribute body is not written.
• This parameter should have the “column-
family:qualifier” format in order for the sink to
write the body to the specific column-
family:qualifier.
19. attr2hbase
• attrPrefix defines which attributes will be written to HBase:
every attribute with the name prefixed with attrPrefix
parameter’s value is written. The attribute key should be in
the following format to be properly written into HBase:
“<attrPrefix><colfam>:<qual>”
• The default value of attrPrefix is “2hb_”. This means that all
attributes with names “2hb_<colfam>:<qual>” should be
written to HBase.
• Attribute with key “<attrPrefix>” must contain row key for
Put, otherwise, if no row can be extracted, the event is
skipped and no record is written to the HBase table.
20. attr2hbase example
• node: tail("/proc/vmstat”) | regexAll("(w+)s+(w+)",
"colname","value") value("2hb_","%{colname}%s", escape=true)
value("2hb_stat:value", "%{value}", escape=true)
attr2hbase("table-attr2hbase","system","body:contents")]
Rowkey Timestamp Column Family:
stat
pgpgin1313244007 t1 value=985543
pgpgin1313244008 t2 value=985543
pgpgin1313244009 t3 value=985543
22. What are the New Plugins ?
• https://cwiki.apache.org/FLUME/flume-
plugins.html
• I pushed OpenTSDB Sink just few weeks back
23. How to Contribute a new Plugin ?
• Extend EventSink.Base
• Override Open() : Have your connections
setup to the Store
• Override Append(): Every new Event gets
processed here. Doing the “Puts” into Store
• Override Close (): Yay! Cleanup the
connections and flushing etc. to the Store.
• Implement a SinkBuilder builder()
25. GOOD LUCK,
HAVE FUN!
Play Free!
http://www.leagueoflegends.com/
Hinweis der Redaktion
----- Meeting Notes (8/17/11 16:51) -----Good Evening GentlemenI'm Dani----- Meeting Notes (8/17/11 17:01) -----Lets see how to hook up these guys Flume and HBase
----- Meeting Notes (8/17/11 16:51) -----Just a brief background Several Patches to Flume:1. Flogger2. Few things in HBase sink3. recently contributed OpenTSDB sink
----- Meeting Notes (8/17/11 16:51) -----My assumption is that folks here know what Flume does and HBase doesSo focusing on
----- Meeting Notes (8/17/11 17:08) -----If anyone haven't used Flume or HBase .. let me know.
----- Meeting Notes (8/17/11 17:08) -----I can take up more questions at end of presentation
----- Meeting Notes (8/17/11 17:11) -----Check out Flume User Guide
----- Meeting Notes (8/17/11 17:14) -----HBase is integrated with Hive and MR
----- Meeting Notes (8/17/11 17:14) -----Those who haven't used: Just think about it as "which of the overloaded functions" Flume has to use.You can change the parameters at run time.