Flume HBase

Hooking up Flume with HBase
LA-HUG Aug’11

-Dani Abel Rayan

Who am I ?
• Big Data Ninja at Riot Games
• Flume Contributor
• Cloudera Intern Alum
• Graduated with Masters CS
from Georgia Tech.

What am I presenting here ?
• Flume event model
• HBase data model
• Compelling reasons to hook ‘em up
• Configuration examples
• What are the new upcoming Sinks ?
• How to write new Flume-Sink.

What is needed before we start ..
• Understanding of Flume’s architecture
• Usage of Flume’s abstractions such as
Plugins, Events, Sources, Sinks, Escape Sequences
and Decorators*
• Understanding of HBase and Hadoop
• Regex
• That’s it!
*http://archive.cloudera.com/cdh/3/flume/UserGuide/index.html

Flume Event Model
• A Flume event has these six main fields: Unix
timestamp, Nanosecond timestamp, Priority,
Source host, Body and a Metadata table with
an arbitrary number of attribute value pairs.
• The body is the raw log entry body. The
default is to truncate the body to a maximum
of 32KB per event. This is a configurable.
• One can custom bucket attributes with help of
escape sequences.

Reasons For HBase Sink
• Near Real-Time aggregation of Streaming Data
• Low Latency access to the aggregated data
• Offline Big Data Analytics

Types of Flume HBase Sink
1. hbase(): Highly expressive
hbase("table", "rowkey", "cf1", "c1", "val1"[,"cf2", "c2", "val2", ....] {,
writeBufferSize=int, writeToWal=true|false})

2. attr2hbase(): Flexible and powerful semantics
but could be confusing (at first glance)
attr2hbase("table"[,"sysFamily"[,"writeBody"[,"attrPrefix"[,"writeBufferSize"
[,"writeToWal"]]]]])

How to Use a Plugin ?
• Compile. Add the jar with the new plugin
classes to flume’s classpath.
• In flume-site.xml, add the class names of the
new sources, sinks, and/or decorators to the
flume.plugin.classes property
• Restart the Flume nodes (Including Master)
• Verify that your plugin is loaded is to check if
it is displayed on this page http://flume-
master:35871/masterext.jsp

hbase()
Source: tail(“/proc/vmstat/”)

nr_free_pages 594693
nr_inactive_anon 1392
nr_active_anon 45259
nr_inactive_file 107132
nr_active_file 141458

Sink:
regexAll(“w+)s+(w+)”,”colname”,”value")
Flume Events

timestamp 24353457
24353456
24353455
colname nr_active_anon
nr_inactive_anon
nr_free_pages
value 45259
1392
594693

hbase()
• hbase("tablename", ”%s", ”stats", ”%{colname}", ”%{value}")
use %{nanos} instead of %s if you want nano-second timestamp

Rowkey Timestamp Column Family: stats

24353455 T1 nr_free_pages = 594693

24353456 T2 nr_inactive_anon = 1392

24353457 T3 nr_active_anon = 45259

hbase()
• Thus the FDL syntax would be:

• node: tail(”/proc/vmstat") |
regexAll("(w+)s+(w+)", ”colname", ”value")
collector(300000) { hbase("table", ”%s", ”stats",
”%{colname}", "%{value}") }

attr2hbase()
• Don’t have to list all possible event attributes
you want to store in HBase along with their
destination column families and qualifiers

• Source and/or decorators can produce any
(reasonable) number of attributes, with
dynamic names (e.g. depending on the values)
and they will be written into HBase

attr2hbase
• attr2hbase("table"[,"sysFamily"[,"writeBody"[,
"attrPrefix"[,"writeBufferSize"
[,"writeToWal"]]]]])
• sysFamily holds the name of the column
family that is used to store “system” data
(event timestamp, host, priority).
• In case this parameter is absent or equals “”,
the sink doesn’t write “system” data

attr2hbase
• writeBody indicates whether event body
should be written with other “system” data.
By default, (when this parameter is absent or
equals ””) the attribute body is not written.
• This parameter should have the “column-
family:qualifier” format in order for the sink to
write the body to the specific column-
family:qualifier.

attr2hbase
• attrPrefix defines which attributes will be written to HBase:
every attribute with the name prefixed with attrPrefix
parameter’s value is written. The attribute key should be in
the following format to be properly written into HBase:
“<attrPrefix><colfam>:<qual>”
• The default value of attrPrefix is “2hb_”. This means that all
attributes with names “2hb_<colfam>:<qual>” should be
written to HBase.
• Attribute with key “<attrPrefix>” must contain row key for
Put, otherwise, if no row can be extracted, the event is
skipped and no record is written to the HBase table.

attr2hbase example
• node: tail("/proc/vmstat”) | regexAll("(w+)s+(w+)",
"colname","value") value("2hb_","%{colname}%s", escape=true)
value("2hb_stat:value", "%{value}", escape=true)
attr2hbase("table-attr2hbase","system","body:contents")]

Rowkey Timestamp Column Family:
stat
pgpgin1313244007 t1 value=985543
pgpgin1313244008 t2 value=985543
pgpgin1313244009 t3 value=985543

What are the New Plugins ?
• https://cwiki.apache.org/FLUME/flume-
plugins.html

• I pushed OpenTSDB Sink just few weeks back

How to Contribute a new Plugin ?
• Extend EventSink.Base
• Override Open() : Have your connections
setup to the Store
• Override Append(): Every new Event gets
processed here. Doing the “Puts” into Store
• Override Close (): Yay! Cleanup the
connections and flushing etc. to the Store.
• Implement a SinkBuilder builder()

My Contacts
• drayan@riotgames.com
• dr@verticalengine.com
• Twitter: rayanandi

P.S. We are Hiring!

GOOD LUCK,
HAVE FUN!
Play Free!
http://www.leagueoflegends.com/

Flume HBase

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Flume HBase

Ähnlich wie Flume HBase (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Flume HBase

Hinweis der Redaktion