1. 11
Headline
Goes
Here
Speaker
Name
or
Subhead
Goes
Here
Building
Hadoop
Data
Applica;ons
with
Kite
Tom
White
@tom_e_white
Hadoop
Users
Group
UK,
London
17
June
2014
2. About
me
• Engineer
at
Cloudera
working
on
Core
Hadoop
and
Kite
• Apache
Hadoop
CommiMer,
PMC
Member,
Apache
Member
• Author
of
“Hadoop:
The
Defini;ve
Guide”
2
7. Glossary
• Apache
Avro
–
cross-‐language
data
serializa;on
library
• Apache
Parquet
(incuba;ng)
–
column-‐oriented
storage
format
for
nested
data
• Apache
Hive
–
data
warehouse
(SQL
and
metastore)
• Apache
Flume
–
streaming
log
capture
and
delivery
system
• Apache
Oozie
–
workflow
scheduler
system
• Apache
Crunch
–
Java
API
for
wri;ng
data
pipelines
• Impala
–
interac;ve
SQL
on
Hadoop
7
8. Outline
• A
Typical
Applica;on
• Kite
SDK
• An
Example
• Advanced
Kite
8
14. Kite
• A
client-‐side
library
for
wri;ng
Hadoop
Data
Applica;ons
• First
release
was
in
April
2013
as
CDK
• 0.14.1
last
month
• Open
source,
Apache
2
license,
kitesdk.org
• Modular
• Data
module
(HDFS,
Flume,
Crunch,
Hive,
HBase)
• Morphlines
transforma;on
module
• Maven
plugin
14
16. Kite
Data
Module
• Dataset
–
a
collec;on
of
en;;es
• DatasetRepository
–
physical
storage
loca;on
for
datasets
• DatasetDescriptor
–
holds
dataset
metadata
(schema,
format)
• DatasetWriter
–
write
en;;es
to
a
dataset
in
a
stream
• DatasetReader
–
read
en;;es
from
a
dataset
• hMp://kitesdk.org/docs/current/apidocs/index.html
16
17. 1.
Define
the
Event
En;ty
public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17
18. 2.
Create
the
Events
Dataset
DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18
19. (2.
or
with
the
Maven
plugin)
$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19
29. Unified
Storage
Interface
• Dataset
–
streaming
access,
HDFS
storage
• RandomAccessDataset
–
random
access,
HBase
storage
• Par;;onStrategy
defines
how
to
map
an
en;ty
to
par;;ons
in
HDFS
or
row
keys
in
HBase
29
30. Filesystem
Par;;ons
PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30
35. Parallel
Processing
• Goal
is
for
Hadoop
processing
frameworks
to
“just
work”
• Support
Formats,
Par;;ons,
Views
• Na;ve
Kite
components,
e.g.
DatasetOutputFormat
for
MR
35
HDFS
Dataset
HBase
Dataset
Crunch
Yes
Yes
MapReduce
Yes
Yes
Hive
Yes
Planned
Impala
Yes
Planned
36. Schema
Evolu;on
public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36
37. Searchable
Datasets
• Use
Flume
Solr
Sink
(in
addi;on
to
HDFS
Sink)
• Morphlines
library
to
define
fields
to
index
• SolrCloud
runs
on
cluster
from
indexes
in
HDFS
• Future
support
in
Kite
to
index
selected
fields
automa;cally
37
39. Kite
makes
it
easy
to
get
data
into
Hadoop
with
a
flexible
schema
model
that
is
storage
agnos;c
in
a
format
that
can
be
processed
with
a
wide
range
of
Hadoop
tools
39
40. Genng
Started
With
Kite
• Examples
at
github.com/kite-‐sdk/kite-‐examples
• Working
with
streaming
and
random-‐access
datasets
• Logging
events
to
datasets
from
a
webapp
• Running
a
periodic
job
• Migra;ng
data
from
CSV
to
a
Kite
dataset
• Conver;ng
an
Avro
dataset
to
a
Parquet
dataset
• Wri;ng
and
configuring
Morphlines
• Using
Morphlines
to
write
JSON
records
to
a
dataset
40
43. Applica;ons
• [Batch]
Analyze
an
archive
of
songs1
• [Interac;ve
SQL]
Ad
hoc
queries
on
recommenda;ons
from
social
media
applica;ons2
• [Search]
Searching
email
traffic
in
near-‐real;me3
• [ML]
Detec;ng
fraudulent
transac;ons
using
clustering4
43
[1]
hMp://blog.cloudera.com/blog/2012/08/process-‐a-‐million-‐songs-‐with-‐apache-‐pig/
[2]
hMp://blog.cloudera.com/blog/2014/01/how-‐wajam-‐answers-‐business-‐ques;ons-‐faster-‐with-‐hadoop/
[3]
hMp://blog.cloudera.com/blog/2013/09/email-‐indexing-‐using-‐cloudera-‐search/
[4]
hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/
44. …
or
use
JDBC
Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44
45. Apps
• App
–
a
packaged
Java
program
that
runs
on
a
Hadoop
cluster
• cdk:package-‐app
–
create
a
package
on
the
local
filesystem
• like
an
exploded
WAR
• Oozie
format
• cdk:deploy-‐app
–
copy
packaged
app
to
HDFS
• cdk:run-‐app
–
execute
the
app
• Workflow
app
–
runs
once
• Coordinator
app
–
runs
other
apps
(like
cron)
45
46. Morphlines
Example
46
morphlines
:
[
{
id
:
morphline1
importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]
commands
:
[
{
readLine
{}
}
{
grok
{
dic;onaryFiles
:
[/tmp/grok-‐dic;onaries]
expressions
:
{
message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""
}
}
}
{
loadSolr
{}
}
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.