SlideShare ist ein Scribd-Unternehmen logo
1 von 46
Downloaden Sie, um offline zu lesen
11
Headline	
  Goes	
  Here	
  
Speaker	
  Name	
  or	
  Subhead	
  Goes	
  Here	
  
Building	
  Hadoop	
  Data	
  Applica;ons	
  with	
  Kite	
  
Tom	
  White	
  @tom_e_white	
  
Hadoop	
  Users	
  Group	
  UK,	
  London	
  
17	
  June	
  2014	
  
About	
  me	
  
•  Engineer	
  at	
  Cloudera	
  working	
  
on	
  Core	
  Hadoop	
  and	
  Kite	
  
•  Apache	
  Hadoop	
  CommiMer,	
  
PMC	
  Member,	
  Apache	
  Member	
  
•  Author	
  of	
  	
  
“Hadoop:	
  The	
  Defini;ve	
  Guide”	
  
2
Hadoop	
  0.1	
  
% cat bigdata.txt | hadoop fs -put - in!
% hadoop MyJob in out!
% hadoop fs -get out!
3
Characteris;cs	
  
•  Batch	
  applica;ons	
  only	
  
•  Low-­‐level	
  coding	
  
•  File	
  format	
  
•  Serializa;on	
  
•  Par;;oning	
  scheme	
  
4
A	
  Hadoop	
  stack	
  
5
Common	
  Data,	
  Many	
  Tools	
  
	
   	
  #	
  tools	
  >>	
  #	
  file	
  formats	
  >>	
  #	
  file	
  systems	
  
6
Glossary	
  
•  Apache	
  Avro	
  –	
  cross-­‐language	
  data	
  serializa;on	
  library	
  
•  Apache	
  Parquet	
  (incuba;ng)	
  –	
  column-­‐oriented	
  storage	
  format	
  
for	
  nested	
  data	
  
•  Apache	
  Hive	
  –	
  data	
  warehouse	
  (SQL	
  and	
  metastore)	
  
•  Apache	
  Flume	
  –	
  streaming	
  log	
  capture	
  and	
  delivery	
  system	
  
•  Apache	
  Oozie	
  –	
  workflow	
  scheduler	
  system	
  
•  Apache	
  Crunch	
  –	
  Java	
  API	
  for	
  wri;ng	
  data	
  pipelines	
  
•  Impala	
  –	
  interac;ve	
  SQL	
  on	
  Hadoop	
  
7
Outline	
  
•  A	
  Typical	
  Applica;on	
  
•  Kite	
  SDK	
  
•  An	
  Example	
  
•  Advanced	
  Kite	
  
8
A	
  typical	
  applica;on	
  (zoom	
  100:1)	
  
9
A	
  typical	
  applica;on	
  (zoom	
  10:1)	
  
10
A	
  typical	
  pipeline	
  (zoom	
  5:1)	
  
11
Kite	
  SDK	
  
12
Kite	
  Codifies	
  Best	
  Prac;ce	
  as	
  APIs,	
  Tools,	
  Docs	
  
and	
  Examples	
  
13
Kite	
  
•  A	
  client-­‐side	
  library	
  for	
  wri;ng	
  Hadoop	
  Data	
  Applica;ons	
  
•  First	
  release	
  was	
  in	
  April	
  2013	
  as	
  CDK	
  
•  0.14.1	
  last	
  month	
  
•  Open	
  source,	
  Apache	
  2	
  license,	
  kitesdk.org	
  
•  Modular	
  
•  Data	
  module	
  (HDFS,	
  Flume,	
  Crunch,	
  Hive,	
  HBase)	
  
•  Morphlines	
  transforma;on	
  module	
  
•  Maven	
  plugin	
  
14
An	
  Example	
  
15
Kite	
  Data	
  Module	
  
•  Dataset	
  –	
  a	
  collec;on	
  of	
  en;;es	
  
•  DatasetRepository	
  –	
  physical	
  storage	
  loca;on	
  for	
  datasets	
  
•  DatasetDescriptor	
  –	
  holds	
  dataset	
  metadata	
  (schema,	
  format)	
  
•  DatasetWriter	
  –	
  write	
  en;;es	
  to	
  a	
  dataset	
  in	
  a	
  stream	
  
•  DatasetReader	
  –	
  read	
  en;;es	
  from	
  a	
  dataset	
  	
  
•  hMp://kitesdk.org/docs/current/apidocs/index.html	
  
16
1.	
  Define	
  the	
  Event	
  En;ty	
  
public class Event {!
private long id;!
private long timestamp;!
private String source;!
// getters and setters!
}!
17
2.	
  Create	
  the	
  Events	
  Dataset	
  
DatasetRepository repo =
DatasetRepositories.open("repo:hive");!
DatasetDescriptor descriptor =!
new DatasetDescriptor.Builder()!
.schema(Event.class).build();!
repo.create("events", descriptor);!
18
(2.	
  or	
  with	
  the	
  Maven	
  plugin)	
  
$ mvn kite:create-dataset !
-Dkite.repositoryUri='repo:hive' !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
19
A	
  peek	
  at	
  the	
  Avro	
  schema	
  
$ hive -e "DESCRIBE EXTENDED events"!
...!
{!
"type" : "record",!
"name" : "Event",!
"namespace" : "com.example",!
"fields" : [!
{ "name" : "id", "type" : "long" },!
{ "name" : "timestamp", "type" : "long" },!
{ "name" : "source", "type" : "string" }!
]!
}!
20
3.	
  Write	
  Events	
  
Logger logger = Logger.getLogger(...);!
Event event = new Event();!
event.setId(id);!
event.setTimestamp(System.currentTimeMillis());!
event.setSource(source);!
logger.info(event);!
21
Log4j	
  configura;on	
  
log4j.appender.flume =
org.kitesdk.data.flume.Log4jAppender!
log4j.appender.flume.Hostname = localhost!
log4j.appender.flume.Port = 41415!
log4j.appender.flume.DatasetRepositoryUri = repo:hive!
log4j.appender.flume.DatasetName = events!
22
The	
  resul;ng	
  file	
  layout	
  
/user!
/hive!
/warehouse!
/events!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
23
Avro	
  
files	
  
4.	
  Generate	
  Summaries	
  with	
  Crunch	
  
PCollection<Event> events =
read(asSource(repo.load("events"), Event.class));!
PCollection<Summary> summaries = events!
.by(new GetTimeBucket(), // minute of day, source!
Avros.pairs(Avros.longs(), Avros.strings()))!
.groupByKey()!
.parallelDo(new MakeSummary(),!
Avros.reflects(Summary.class));!
write(summaries, asTarget(repo.load("summaries"))!24
…	
  and	
  run	
  using	
  Maven	
  
$ mvn kite:create-dataset -Dkite.datasetName=summaries ...!
<plugin>!
<groupId>org.kitesdk</groupId>!
<artifactId>kite-maven-plugin</artifactId>!
<configuration>!
<toolClass>com.example.GenerateSummaries</toolClass>!
</configuration>!
</plugin>!
$ mvn kite:run-tool!
25
5.	
  Query	
  with	
  Impala	
  
$ impala-shell -q ’DESCRIBE events'!
+-----------+--------+-------------------+!
| name | type | comment |!
+-----------+--------+-------------------+!
| id | bigint | from deserializer |!
| timestamp | bigint | from deserializer |!
| source | string | from deserializer |!
+-----------+--------+-------------------+!
26
…	
  Ad	
  Hoc	
  Queries	
  
$ impala-shell -q 'SELECT source, COUNT(1) AS cnt
FROM events GROUP BY source'!
+--------------------------------------+-----+!
| source | cnt |!
+--------------------------------------+-----+!
| 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |!
| bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |!
+--------------------------------------+-----+!
Returned 2 row(s) in 0.56s!
27
Advanced	
  Kite	
  
28
Unified	
  Storage	
  Interface	
  
•  Dataset	
  –	
  streaming	
  access,	
  HDFS	
  storage	
  
•  RandomAccessDataset	
  –	
  random	
  access,	
  HBase	
  storage	
  
•  Par;;onStrategy	
  defines	
  how	
  to	
  map	
  an	
  en;ty	
  to	
  par;;ons	
  in	
  
HDFS	
  or	
  row	
  keys	
  in	
  HBase	
  
29
Filesystem	
  Par;;ons	
  
PartitionStrategy p = new PartitionStrategy.Builder()!
.year("timestamp")!
.month("timestamp")!
.day("timestamp").build();!
/user/hive/warehouse/events!
/year=2014/month=02/day=08!
/FlumeData.1375659013795!
/FlumeData.1375659013796!
30
HBase	
  Keys:	
  Defined	
  in	
  Avro	
  
{!
"name": "username",!
"type": "string",!
"mapping": { "type": "key", "value": "0" }!
},!
{!
"name": "favoriteColor",!
"type": "string",!
"mapping": { "type": "column", "value": "meta:fc" }!
}!
31
Random	
  Access	
  Dataset:	
  Crea;on	
  
RandomAccessDatasetRepository repo =
DatasetRepositories.openRandomAccess(!
"repo:hbase:localhost");!
RandomAccessDataset<User> users = repo.load("users");!
users.put(new User("bill", "green"));!
users.put(new User("alice", "blue"));!
32
Random	
  Access	
  Dataset:	
  Retrieval	
  
Key key = new Key.Builder(users)!
.add("username", "bill").build();!
User bill = users.get(key);!
33
Views	
  
View<User> view = users.from("username", "bill");!
DatasetReader<User> reader = view.newReader();!
reader.open();!
for (User user : reader) {!
System.out.println(user);!
}!
reader.close();!
34
Parallel	
  Processing	
  
•  Goal	
  is	
  for	
  Hadoop	
  processing	
  frameworks	
  to	
  “just	
  work”	
  
•  Support	
  Formats,	
  Par;;ons,	
  Views	
  
•  Na;ve	
  Kite	
  components,	
  e.g.	
  DatasetOutputFormat	
  for	
  MR	
  
35
HDFS	
  Dataset	
   HBase	
  Dataset	
  
Crunch	
   Yes	
   Yes	
  
MapReduce	
   Yes	
   Yes	
  
Hive	
   Yes	
   Planned	
  
Impala	
   Yes	
   Planned	
  
Schema	
  Evolu;on	
  
public class Event {!
private long id;!
private long timestamp;!
private String source;!
@Nullable private String ipAddress;!
}!
$ mvn kite:update-dataset !
-Dkite.datasetName=events !
-Dkite.avroSchemaReflectClass=com.example.Event!
36
Searchable	
  Datasets	
  
•  Use	
  Flume	
  Solr	
  Sink	
  (in	
  
addi;on	
  to	
  HDFS	
  Sink)	
  
•  Morphlines	
  library	
  to	
  define	
  
fields	
  to	
  index	
  
•  SolrCloud	
  runs	
  on	
  cluster	
  from	
  
indexes	
  in	
  HDFS	
  
•  Future	
  support	
  in	
  Kite	
  to	
  index	
  
selected	
  fields	
  automa;cally	
  
37
Conclusion	
  
38
Kite	
  makes	
  it	
  easy	
  to	
  get	
  data	
  into	
  Hadoop	
  
with	
  a	
  flexible	
  schema	
  model	
  that	
  is	
  storage	
  
agnos;c	
  in	
  a	
  format	
  that	
  can	
  be	
  processed	
  
with	
  a	
  wide	
  range	
  of	
  Hadoop	
  tools	
  
39
Genng	
  Started	
  With	
  Kite	
  
•  Examples	
  at	
  github.com/kite-­‐sdk/kite-­‐examples	
  
•  Working	
  with	
  streaming	
  and	
  random-­‐access	
  datasets	
  
•  Logging	
  events	
  to	
  datasets	
  from	
  a	
  webapp	
  
•  Running	
  a	
  periodic	
  job	
  
•  Migra;ng	
  data	
  from	
  CSV	
  to	
  a	
  Kite	
  dataset	
  
•  Conver;ng	
  an	
  Avro	
  dataset	
  to	
  a	
  Parquet	
  dataset	
  
•  Wri;ng	
  and	
  configuring	
  Morphlines	
  
•  Using	
  Morphlines	
  to	
  write	
  JSON	
  records	
  to	
  a	
  dataset	
  
40
Ques;ons?	
  
kitesdk.org	
  
@tom_e_white	
  
tom@cloudera.com	
  
41
4242
Applica;ons	
  
•  [Batch]	
  Analyze	
  an	
  archive	
  of	
  songs1	
  
•  [Interac;ve	
  SQL]	
  Ad	
  hoc	
  queries	
  on	
  recommenda;ons	
  from	
  
social	
  media	
  applica;ons2	
  
•  [Search]	
  Searching	
  email	
  traffic	
  in	
  near-­‐real;me3	
  
•  [ML]	
  Detec;ng	
  fraudulent	
  transac;ons	
  using	
  clustering4	
  
43
[1]	
  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/	
  	
  
[2]	
  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/	
  	
  
[3]	
  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/	
  	
  
[4]	
  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/	
  	
  
…	
  or	
  use	
  JDBC	
  
Class.forName("org.apache.hive.jdbc.HiveDriver");!
Connection connection = DriverManager.getConnection(!
"jdbc:hive2://localhost:21050/;auth=noSasl");!
Statement statement = connection.createStatement();!
ResultSet resultSet = statement.executeQuery(!
"SELECT * FROM summaries");!
44
Apps	
  
•  App	
  –	
  a	
  packaged	
  Java	
  program	
  that	
  runs	
  on	
  a	
  Hadoop	
  cluster	
  
•  cdk:package-­‐app	
  –	
  create	
  a	
  package	
  on	
  the	
  local	
  filesystem	
  
•  like	
  an	
  exploded	
  WAR	
  
•  Oozie	
  format	
  
•  cdk:deploy-­‐app	
  –	
  copy	
  packaged	
  app	
  to	
  HDFS	
  
•  cdk:run-­‐app	
  –	
  execute	
  the	
  app	
  
•  Workflow	
  app	
  –	
  runs	
  once	
  
•  Coordinator	
  app	
  –	
  runs	
  other	
  apps	
  (like	
  cron)	
  
45
Morphlines	
  Example	
  
46
morphlines	
  :	
  [	
  
	
  {	
  
	
  	
  	
  id	
  :	
  morphline1	
  
	
  	
  	
  importCommands	
  :	
  ["com.cloudera.**",	
  "org.apache.solr.**"]	
  
	
  	
  	
  commands	
  :	
  [	
  
	
  	
  	
  	
  	
  {	
  readLine	
  {}	
  }	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  grok	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  dic;onaryFiles	
  :	
  [/tmp/grok-­‐dic;onaries]	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  expressions	
  :	
  {	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  message	
  :	
  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}	
  %
{SYSLOGHOST:syslog_hostname}	
  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:	
  %
{GREEDYDATA:syslog_message}"""	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  }	
  
	
  	
  	
  	
  	
  {	
  loadSolr	
  {}	
  }	
  	
  	
  	
  	
  	
  
	
  	
  	
  	
  ]	
  
	
  }	
  
]	
  
Example Input	

<164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22	

Output Record	

syslog_pri:164	

syslog_timestamp:Feb  4 10:46:14	

syslog_hostname:syslog	

syslog_program:sshd	

syslog_pid:607	

syslog_message:listening on 0.0.0.0 port 22.

Weitere ähnliche Inhalte

Was ist angesagt?

Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Brian O'Neill
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...Amazon Web Services
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQLSATOSHI TAGOMORI
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaYaroslav Tkachenko
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014Amazon Web Services
 
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...Amazon Web Services
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1Joe Stein
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 

Was ist angesagt? (20)

Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
 
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Lambda Architecture Using SQL
Lambda Architecture Using SQLLambda Architecture Using SQL
Lambda Architecture Using SQL
 
Querying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS AthenaQuerying Data Pipeline with AWS Athena
Querying Data Pipeline with AWS Athena
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
 
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
(BDT403) Netflix's Next Generation Big Data Platform | AWS re:Invent 2014
 
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...
(SDD424) Simplifying Scalable Distributed Applications Using DynamoDB Streams...
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Presto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte ScalePresto @ Netflix: Interactive Queries at Petabyte Scale
Presto @ Netflix: Interactive Queries at Petabyte Scale
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 

Ähnlich wie Building Hadoop Data Applications with Kite

Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationAdam Doyle
 
drupal 7 amfserver presentation: integrating flash and drupal
drupal 7 amfserver presentation: integrating flash and drupaldrupal 7 amfserver presentation: integrating flash and drupal
drupal 7 amfserver presentation: integrating flash and drupalrolf vreijdenberger
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Dok Talks #124 - Intro to Druid on Kubernetes
Dok Talks #124 - Intro to Druid on KubernetesDok Talks #124 - Intro to Druid on Kubernetes
Dok Talks #124 - Intro to Druid on KubernetesDoKC
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Startedabramsm
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauCodemotion
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationFIWARE
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 

Ähnlich wie Building Hadoop Data Applications with Kite (20)

Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Spark etl
Spark etlSpark etl
Spark etl
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
drupal 7 amfserver presentation: integrating flash and drupal
drupal 7 amfserver presentation: integrating flash and drupaldrupal 7 amfserver presentation: integrating flash and drupal
drupal 7 amfserver presentation: integrating flash and drupal
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Dok Talks #124 - Intro to Druid on Kubernetes
Dok Talks #124 - Intro to Druid on KubernetesDok Talks #124 - Intro to Druid on Kubernetes
Dok Talks #124 - Intro to Druid on Kubernetes
 
Hydra - Getting Started
Hydra - Getting StartedHydra - Getting Started
Hydra - Getting Started
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 
Cosmos, Big Data GE Implementation
Cosmos, Big Data GE ImplementationCosmos, Big Data GE Implementation
Cosmos, Big Data GE Implementation
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 

Mehr von huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

Mehr von huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Kürzlich hochgeladen

ALL NFL NETWORK CONTACTS- April 29, 2024
ALL NFL NETWORK CONTACTS- April 29, 2024ALL NFL NETWORK CONTACTS- April 29, 2024
ALL NFL NETWORK CONTACTS- April 29, 2024Brian Slack
 
ppt on Myself, Occupation and my Interest
ppt on Myself, Occupation and my Interestppt on Myself, Occupation and my Interest
ppt on Myself, Occupation and my InterestNagaissenValaydum
 
Tableaux 9ème étape circuit fédéral 2024
Tableaux 9ème étape circuit fédéral 2024Tableaux 9ème étape circuit fédéral 2024
Tableaux 9ème étape circuit fédéral 2024HechemLaameri
 
Plan d'orientations stratégiques rugby féminin
Plan d'orientations stratégiques rugby fémininPlan d'orientations stratégiques rugby féminin
Plan d'orientations stratégiques rugby fémininThibaut TATRY
 
Chennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts servicevipmodelshub1
 
JORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdfJORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdfArturo Pacheco Alvarez
 
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...World Wide Tickets And Hospitality
 
Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...
Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...
Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...Neil Horowitz
 
大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改
大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改
大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改atducpo
 
JORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdf
JORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdfJORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdf
JORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdfArturo Pacheco Alvarez
 
Dubai Call Girls Bikni O528786472 Call Girls Dubai Ebony
Dubai Call Girls Bikni O528786472 Call Girls Dubai EbonyDubai Call Girls Bikni O528786472 Call Girls Dubai Ebony
Dubai Call Girls Bikni O528786472 Call Girls Dubai Ebonyhf8803863
 
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdfTAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdfSocial Samosa
 
Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...
Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...
Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...Eticketing.co
 
Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...
Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...
Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...Eticketing.co
 
08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking Men08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking MenDelhi Call girls
 
( Sports training) All topic (MCQs).pptx
( Sports training) All topic (MCQs).pptx( Sports training) All topic (MCQs).pptx
( Sports training) All topic (MCQs).pptxParshotamGupta1
 

Kürzlich hochgeladen (20)

Call Girls Service Noida Extension @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
Call Girls Service Noida Extension @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...Call Girls Service Noida Extension @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...
Call Girls Service Noida Extension @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
 
ALL NFL NETWORK CONTACTS- April 29, 2024
ALL NFL NETWORK CONTACTS- April 29, 2024ALL NFL NETWORK CONTACTS- April 29, 2024
ALL NFL NETWORK CONTACTS- April 29, 2024
 
ppt on Myself, Occupation and my Interest
ppt on Myself, Occupation and my Interestppt on Myself, Occupation and my Interest
ppt on Myself, Occupation and my Interest
 
Call Girls In RK Puram 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In RK Puram 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In RK Puram 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In RK Puram 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
Tableaux 9ème étape circuit fédéral 2024
Tableaux 9ème étape circuit fédéral 2024Tableaux 9ème étape circuit fédéral 2024
Tableaux 9ème étape circuit fédéral 2024
 
Call Girls 🫤 Paharganj ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ENJOY
Call Girls 🫤 Paharganj ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ENJOYCall Girls 🫤 Paharganj ➡️ 9999965857  ➡️ Delhi 🫦  Russian Escorts FULL ENJOY
Call Girls 🫤 Paharganj ➡️ 9999965857 ➡️ Delhi 🫦 Russian Escorts FULL ENJOY
 
Plan d'orientations stratégiques rugby féminin
Plan d'orientations stratégiques rugby fémininPlan d'orientations stratégiques rugby féminin
Plan d'orientations stratégiques rugby féminin
 
Chennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts serviceChennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts service
Chennai Call Girls Anna Nagar Phone 🍆 8250192130 👅 celebrity escorts service
 
JORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdfJORNADA 5 LIGA MURO 2024INSUGURACION.pdf
JORNADA 5 LIGA MURO 2024INSUGURACION.pdf
 
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
Spain Vs Italy 20 players confirmed for Spain's Euro 2024 squad, and three po...
 
Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...
Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...
Atlanta Dream Exec Dan Gadd on Driving Fan Engagement and Growth, Serving the...
 
Call Girls In Vasundhara 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Vasundhara 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Vasundhara 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
Call Girls In Vasundhara 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE
 
大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改
大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改
大学学位办理《原版美国USD学位证书》圣地亚哥大学毕业证制作成绩单修改
 
JORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdf
JORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdfJORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdf
JORNADA 4 LIGA MURO 2024TUXTEPEC1234.pdf
 
Dubai Call Girls Bikni O528786472 Call Girls Dubai Ebony
Dubai Call Girls Bikni O528786472 Call Girls Dubai EbonyDubai Call Girls Bikni O528786472 Call Girls Dubai Ebony
Dubai Call Girls Bikni O528786472 Call Girls Dubai Ebony
 
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdfTAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
TAM Sports_IPL 17 Till Match 37_Celebrity Endorsement _Report.pdf
 
Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...
Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...
Croatia vs Albania Clash of Euro Cup 2024 Squad Preparations and Euro Cup Dre...
 
Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...
Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...
Italy vs Albania Tickets: Italy's Quest for Euro Cup Germany History, Defendi...
 
08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking Men08448380779 Call Girls In IIT Women Seeking Men
08448380779 Call Girls In IIT Women Seeking Men
 
( Sports training) All topic (MCQs).pptx
( Sports training) All topic (MCQs).pptx( Sports training) All topic (MCQs).pptx
( Sports training) All topic (MCQs).pptx
 

Building Hadoop Data Applications with Kite

  • 1. 11 Headline  Goes  Here   Speaker  Name  or  Subhead  Goes  Here   Building  Hadoop  Data  Applica;ons  with  Kite   Tom  White  @tom_e_white   Hadoop  Users  Group  UK,  London   17  June  2014  
  • 2. About  me   •  Engineer  at  Cloudera  working   on  Core  Hadoop  and  Kite   •  Apache  Hadoop  CommiMer,   PMC  Member,  Apache  Member   •  Author  of     “Hadoop:  The  Defini;ve  Guide”   2
  • 3. Hadoop  0.1   % cat bigdata.txt | hadoop fs -put - in! % hadoop MyJob in out! % hadoop fs -get out! 3
  • 4. Characteris;cs   •  Batch  applica;ons  only   •  Low-­‐level  coding   •  File  format   •  Serializa;on   •  Par;;oning  scheme   4
  • 6. Common  Data,  Many  Tools      #  tools  >>  #  file  formats  >>  #  file  systems   6
  • 7. Glossary   •  Apache  Avro  –  cross-­‐language  data  serializa;on  library   •  Apache  Parquet  (incuba;ng)  –  column-­‐oriented  storage  format   for  nested  data   •  Apache  Hive  –  data  warehouse  (SQL  and  metastore)   •  Apache  Flume  –  streaming  log  capture  and  delivery  system   •  Apache  Oozie  –  workflow  scheduler  system   •  Apache  Crunch  –  Java  API  for  wri;ng  data  pipelines   •  Impala  –  interac;ve  SQL  on  Hadoop   7
  • 8. Outline   •  A  Typical  Applica;on   •  Kite  SDK   •  An  Example   •  Advanced  Kite   8
  • 9. A  typical  applica;on  (zoom  100:1)   9
  • 10. A  typical  applica;on  (zoom  10:1)   10
  • 11. A  typical  pipeline  (zoom  5:1)   11
  • 13. Kite  Codifies  Best  Prac;ce  as  APIs,  Tools,  Docs   and  Examples   13
  • 14. Kite   •  A  client-­‐side  library  for  wri;ng  Hadoop  Data  Applica;ons   •  First  release  was  in  April  2013  as  CDK   •  0.14.1  last  month   •  Open  source,  Apache  2  license,  kitesdk.org   •  Modular   •  Data  module  (HDFS,  Flume,  Crunch,  Hive,  HBase)   •  Morphlines  transforma;on  module   •  Maven  plugin   14
  • 16. Kite  Data  Module   •  Dataset  –  a  collec;on  of  en;;es   •  DatasetRepository  –  physical  storage  loca;on  for  datasets   •  DatasetDescriptor  –  holds  dataset  metadata  (schema,  format)   •  DatasetWriter  –  write  en;;es  to  a  dataset  in  a  stream   •  DatasetReader  –  read  en;;es  from  a  dataset     •  hMp://kitesdk.org/docs/current/apidocs/index.html   16
  • 17. 1.  Define  the  Event  En;ty   public class Event {! private long id;! private long timestamp;! private String source;! // getters and setters! }! 17
  • 18. 2.  Create  the  Events  Dataset   DatasetRepository repo = DatasetRepositories.open("repo:hive");! DatasetDescriptor descriptor =! new DatasetDescriptor.Builder()! .schema(Event.class).build();! repo.create("events", descriptor);! 18
  • 19. (2.  or  with  the  Maven  plugin)   $ mvn kite:create-dataset ! -Dkite.repositoryUri='repo:hive' ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 19
  • 20. A  peek  at  the  Avro  schema   $ hive -e "DESCRIBE EXTENDED events"! ...! {! "type" : "record",! "name" : "Event",! "namespace" : "com.example",! "fields" : [! { "name" : "id", "type" : "long" },! { "name" : "timestamp", "type" : "long" },! { "name" : "source", "type" : "string" }! ]! }! 20
  • 21. 3.  Write  Events   Logger logger = Logger.getLogger(...);! Event event = new Event();! event.setId(id);! event.setTimestamp(System.currentTimeMillis());! event.setSource(source);! logger.info(event);! 21
  • 22. Log4j  configura;on   log4j.appender.flume = org.kitesdk.data.flume.Log4jAppender! log4j.appender.flume.Hostname = localhost! log4j.appender.flume.Port = 41415! log4j.appender.flume.DatasetRepositoryUri = repo:hive! log4j.appender.flume.DatasetName = events! 22
  • 23. The  resul;ng  file  layout   /user! /hive! /warehouse! /events! /FlumeData.1375659013795! /FlumeData.1375659013796! 23 Avro   files  
  • 24. 4.  Generate  Summaries  with  Crunch   PCollection<Event> events = read(asSource(repo.load("events"), Event.class));! PCollection<Summary> summaries = events! .by(new GetTimeBucket(), // minute of day, source! Avros.pairs(Avros.longs(), Avros.strings()))! .groupByKey()! .parallelDo(new MakeSummary(),! Avros.reflects(Summary.class));! write(summaries, asTarget(repo.load("summaries"))!24
  • 25. …  and  run  using  Maven   $ mvn kite:create-dataset -Dkite.datasetName=summaries ...! <plugin>! <groupId>org.kitesdk</groupId>! <artifactId>kite-maven-plugin</artifactId>! <configuration>! <toolClass>com.example.GenerateSummaries</toolClass>! </configuration>! </plugin>! $ mvn kite:run-tool! 25
  • 26. 5.  Query  with  Impala   $ impala-shell -q ’DESCRIBE events'! +-----------+--------+-------------------+! | name | type | comment |! +-----------+--------+-------------------+! | id | bigint | from deserializer |! | timestamp | bigint | from deserializer |! | source | string | from deserializer |! +-----------+--------+-------------------+! 26
  • 27. …  Ad  Hoc  Queries   $ impala-shell -q 'SELECT source, COUNT(1) AS cnt FROM events GROUP BY source'! +--------------------------------------+-----+! | source | cnt |! +--------------------------------------+-----+! | 018dc1b6-e6b0-489e-bce3-115917e00632 | 38 |! | bc80040e-09d1-4ad2-8bd8-82afd1b8431a | 85 |! +--------------------------------------+-----+! Returned 2 row(s) in 0.56s! 27
  • 29. Unified  Storage  Interface   •  Dataset  –  streaming  access,  HDFS  storage   •  RandomAccessDataset  –  random  access,  HBase  storage   •  Par;;onStrategy  defines  how  to  map  an  en;ty  to  par;;ons  in   HDFS  or  row  keys  in  HBase   29
  • 30. Filesystem  Par;;ons   PartitionStrategy p = new PartitionStrategy.Builder()! .year("timestamp")! .month("timestamp")! .day("timestamp").build();! /user/hive/warehouse/events! /year=2014/month=02/day=08! /FlumeData.1375659013795! /FlumeData.1375659013796! 30
  • 31. HBase  Keys:  Defined  in  Avro   {! "name": "username",! "type": "string",! "mapping": { "type": "key", "value": "0" }! },! {! "name": "favoriteColor",! "type": "string",! "mapping": { "type": "column", "value": "meta:fc" }! }! 31
  • 32. Random  Access  Dataset:  Crea;on   RandomAccessDatasetRepository repo = DatasetRepositories.openRandomAccess(! "repo:hbase:localhost");! RandomAccessDataset<User> users = repo.load("users");! users.put(new User("bill", "green"));! users.put(new User("alice", "blue"));! 32
  • 33. Random  Access  Dataset:  Retrieval   Key key = new Key.Builder(users)! .add("username", "bill").build();! User bill = users.get(key);! 33
  • 34. Views   View<User> view = users.from("username", "bill");! DatasetReader<User> reader = view.newReader();! reader.open();! for (User user : reader) {! System.out.println(user);! }! reader.close();! 34
  • 35. Parallel  Processing   •  Goal  is  for  Hadoop  processing  frameworks  to  “just  work”   •  Support  Formats,  Par;;ons,  Views   •  Na;ve  Kite  components,  e.g.  DatasetOutputFormat  for  MR   35 HDFS  Dataset   HBase  Dataset   Crunch   Yes   Yes   MapReduce   Yes   Yes   Hive   Yes   Planned   Impala   Yes   Planned  
  • 36. Schema  Evolu;on   public class Event {! private long id;! private long timestamp;! private String source;! @Nullable private String ipAddress;! }! $ mvn kite:update-dataset ! -Dkite.datasetName=events ! -Dkite.avroSchemaReflectClass=com.example.Event! 36
  • 37. Searchable  Datasets   •  Use  Flume  Solr  Sink  (in   addi;on  to  HDFS  Sink)   •  Morphlines  library  to  define   fields  to  index   •  SolrCloud  runs  on  cluster  from   indexes  in  HDFS   •  Future  support  in  Kite  to  index   selected  fields  automa;cally   37
  • 39. Kite  makes  it  easy  to  get  data  into  Hadoop   with  a  flexible  schema  model  that  is  storage   agnos;c  in  a  format  that  can  be  processed   with  a  wide  range  of  Hadoop  tools   39
  • 40. Genng  Started  With  Kite   •  Examples  at  github.com/kite-­‐sdk/kite-­‐examples   •  Working  with  streaming  and  random-­‐access  datasets   •  Logging  events  to  datasets  from  a  webapp   •  Running  a  periodic  job   •  Migra;ng  data  from  CSV  to  a  Kite  dataset   •  Conver;ng  an  Avro  dataset  to  a  Parquet  dataset   •  Wri;ng  and  configuring  Morphlines   •  Using  Morphlines  to  write  JSON  records  to  a  dataset   40
  • 41. Ques;ons?   kitesdk.org   @tom_e_white   tom@cloudera.com   41
  • 42. 4242
  • 43. Applica;ons   •  [Batch]  Analyze  an  archive  of  songs1   •  [Interac;ve  SQL]  Ad  hoc  queries  on  recommenda;ons  from   social  media  applica;ons2   •  [Search]  Searching  email  traffic  in  near-­‐real;me3   •  [ML]  Detec;ng  fraudulent  transac;ons  using  clustering4   43 [1]  hMp://blog.cloudera.com/blog/2012/08/process-­‐a-­‐million-­‐songs-­‐with-­‐apache-­‐pig/     [2]  hMp://blog.cloudera.com/blog/2014/01/how-­‐wajam-­‐answers-­‐business-­‐ques;ons-­‐faster-­‐with-­‐hadoop/     [3]  hMp://blog.cloudera.com/blog/2013/09/email-­‐indexing-­‐using-­‐cloudera-­‐search/     [4]  hMp://blog.cloudera.com/blog/2013/03/cloudera_ml_data_science_tools/    
  • 44. …  or  use  JDBC   Class.forName("org.apache.hive.jdbc.HiveDriver");! Connection connection = DriverManager.getConnection(! "jdbc:hive2://localhost:21050/;auth=noSasl");! Statement statement = connection.createStatement();! ResultSet resultSet = statement.executeQuery(! "SELECT * FROM summaries");! 44
  • 45. Apps   •  App  –  a  packaged  Java  program  that  runs  on  a  Hadoop  cluster   •  cdk:package-­‐app  –  create  a  package  on  the  local  filesystem   •  like  an  exploded  WAR   •  Oozie  format   •  cdk:deploy-­‐app  –  copy  packaged  app  to  HDFS   •  cdk:run-­‐app  –  execute  the  app   •  Workflow  app  –  runs  once   •  Coordinator  app  –  runs  other  apps  (like  cron)   45
  • 46. Morphlines  Example   46 morphlines  :  [    {        id  :  morphline1        importCommands  :  ["com.cloudera.**",  "org.apache.solr.**"]        commands  :  [            {  readLine  {}  }                                                    {                  grok  {                      dic;onaryFiles  :  [/tmp/grok-­‐dic;onaries]                                                                                  expressions  :  {                          message  :  """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_;mestamp}  % {SYSLOGHOST:syslog_hostname}  %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:  % {GREEDYDATA:syslog_message}"""                    }                }            }            {  loadSolr  {}  }                    ]    }   ]   Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.