SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
© Cloudera, Inc. All rights reserved.
Lab #1 - VM setup
http://tiny.cloudera.com/StrataLab1
Lab #2 - Create a movies dataset
http://tiny.cloudera.com/StrataLab2
© Cloudera, Inc. All rights reserved.
Strata London 2015
Building an Apache Hadoop
Data Application
Ryan Blue, Joey Echeverria, Tom White
© Cloudera, Inc. All rights reserved.
Content for today’s tutorial
●The Hadoop Ecosystem
●Storage on Hadoop
●Movie ratings app: Data ingest
●Movie ratings app: Data analysis
© Cloudera, Inc. All rights reserved.
The Hadoop Ecosystem
© Cloudera, Inc. All rights reserved.
A Hadoop Stack
© Cloudera, Inc. All rights reserved.
Processing frameworks
●Code: MapReduce, Crunch, Spark, Tez
●SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto
●Tuples: Cascading, Pig
●Streaming: Spark streaming (micro-batch), Storm, Samza
© Cloudera, Inc. All rights reserved.
Coding frameworks
●Crunch
● A layer around MR (or Spark) that simplifies writing pipelines
●Spark
● A completely new framework for processing pipelines
● Takes advantage of memory, runs a DAG without extra map phases
●Tez
● DAG-based, like Spark’s execution engine without user-level API
© Cloudera, Inc. All rights reserved.
SQL on Hadoop
●Hive for batch processing
●Impala for low-latency queries
●Phoenix and Trafodion for transactional queries on HBase
© Cloudera, Inc. All rights reserved.
Ingest tools
●Relational: Sqoop, Sqoop2
●Record channel: Kafka, Flume
●Files: NiFi
●Numerous commercial options
© Cloudera, Inc. All rights reserved.
Ingest tools
●Relational: Sqoop, Sqoop2
●Record channel: Kafka, Flume
●Files: NiFi
Database
App
HDFS
Hadoop
FlumeFlume
© Cloudera, Inc. All rights reserved.
Relational DB to Hadoop
●Sqoop
● CLI to run MR-based import jobs
●Sqoop2
● Fixes configuration problems with Sqoop with credentials service
● More flexible to run on non-MR frameworks
● New and under active development
© Cloudera, Inc. All rights reserved.
Ingest tools
●Relational: Sqoop, Sqoop2
●Record channel: Kafka, Flume
●Files: NiFi
Database
App
HDFS
Hadoop
FlumeFlumeChannels
© Cloudera, Inc. All rights reserved.
Record streams to Hadoop
●Flume - source, channel, sink architecture
● Well-established and integrated with other tools
● No order guarantee, duplicates are possible
●Kafka - pub-sub model for low latencies
● Partitioned, provides ordering guarantees, easier to eliminate duplicates
● More resilient to node failure with consumer groups
© Cloudera, Inc. All rights reserved.
Files to Hadoop
●NiFi
● Web GUI for drag & drop configuration of a data flow
● Enterprise features: back-pressure, monitoring, lineage, etc.
● Integration to and from spool directory, HTTP, FTP, SFTP, and HDFS
● Originally for files, capable of handling record-based streams
● Currently in the Apache Incubator, and widely deployed privately
© Cloudera, Inc. All rights reserved.
NiFi
© Cloudera, Inc. All rights reserved.
Data storage in Hadoop
© Cloudera, Inc. All rights reserved.
data1.avro
...
HDFS Blocks
●Blocks
● Increase parallelism
● Balance work
● Replicated
●Configured by dfs.blocksize
● Client-side setting
data2.avro
© Cloudera, Inc. All rights reserved.
Splittable File Formats
●Splittable: Able to process part of a file
● Process blocks in parallel
●Avro is splittable
●Gzipped content is not splittable
●CSV is effectively not splittable
© Cloudera, Inc. All rights reserved.
File formats
●Existing formats: XML, JSON, Protobuf, Thrift
●Designed for Hadoop: SequenceFile, RCFile, ORC
●Makes me sad: Delimited text
●Recommended: Avro or Parquet
© Cloudera, Inc. All rights reserved.
Avro
●Recommended row-oriented format
● Broken into blocks with sync markers for splitting
● Binary encoding with block-level compression
●Avro schema
● Required to read any binary-encoded data!
● Written in the file header
●Flexible object models
© Cloudera, Inc. All rights reserved.
Avro in-memory object models
●generic
● Object model that can be used with any schema
●specific - compile schema to java object
● Generates type-safe runtime objects
●reflect - java object to schema
● Uses existing classes and objects
© Cloudera, Inc. All rights reserved.
Lab #3 - Using avro-tools
http://tiny.cloudera.com/StrataLab3
© Cloudera, Inc. All rights reserved.
Row- and column-oriented formats
●Able to reduce I/O when projecting columns
●Better encoding and compression
Images © Twitter, Inc.
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
© Cloudera, Inc. All rights reserved.
Parquet
●Recommended column-oriented format
● Splittable by organizing into row groups
● Efficient binary encoding, supports compression
●Uses other object models
● Record construction API rather than object model
● parquet-avro - Use Avro schemas with generic or specific records
● parquet-protobuf, parquet-thrift, parquet-hive, etc.
© Cloudera, Inc. All rights reserved.
Parquet trade-offs
●Rows are buffered into groups that target a final size
●Row group size
● Memory consumption grows with row group size
● Larger groups get more I/O benefit and better encoding
●Memory consumption grows for each open file
© Cloudera, Inc. All rights reserved.
Lab #4 - Using parquet-tools
http://tiny.cloudera.com/StrataLab4
© Cloudera, Inc. All rights reserved.
Partitioning
●Splittable file formats aren’t enough
●Not processing data is better than processing in parallel
●Organize data to avoid processing: Partitioning
●Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/
© Cloudera, Inc. All rights reserved.
Partitioning Caution
●Partitioning in HDFS is the primary index to data
● Should reflect the most common access pattern
● Test partition strategies for multiple workloads
●Should balance file size with workload
● Lots of small files are bad for HDFS - partitioning should be more coarse
● Larger files take longer to find data - partitioning should be more specific
© Cloudera, Inc. All rights reserved.
Implementing partitioning
●Build your own - not recommended
●Hive and Impala managed
● Partitions are treated as data columns
● Insert statements must include partition calculations
●Kite managed
● Partition strategy configuration file
● Compatible with Hive and Impala
© Cloudera, Inc. All rights reserved.
Kite
●High-level data API for Hadoop
● Built around datasets, not files
● Tasks like partitioning are done internally
●Tools built around the data API
● Command-line
● Integration in Flume, Sqoop, NiFi, etc.
© Cloudera, Inc. All rights reserved.
Lab #5 - Create a partitioned
dataset
http://tiny.cloudera.com/StrataLab5
© Cloudera, Inc. All rights reserved.
Movie ratings app:
Data ingest pipeline
© Cloudera, Inc. All rights reserved.
Movie ratings scenario
●Your company runs a web application where users can rate movies
●You want to use Hadoop to analyze ratings over time
● Avoid scraping the production database for changes
● Instead, you want to log every rating submitted
Database
Ratings App
© Cloudera, Inc. All rights reserved.
Movie ratings app
●Log ratings to Flume
●Otherwise unchanged
Database
Ratings App
HDFSRatings
dataset
Hadoop
FlumeFlumeFlume
© Cloudera, Inc. All rights reserved.
Lab #6 - Create a Flume pipeline
http://tiny.cloudera.com/StrataLab6
© Cloudera, Inc. All rights reserved.
Movie ratings app:
Analyzing ratings data
© Cloudera, Inc. All rights reserved.
●Now you have several months of data
●You can query it in Hive and Impala for most cases
●Some questions are difficult to formulate as SQL
● Are there any movies that people either love or hate?
Movie ratings analysis
© Cloudera, Inc. All rights reserved.
Analyzing ratings
●Map
● Extract key, movie_id, and value, rating
●Reduce:
● Reduce groups all of the ratings by movie_id
● Count the number of ratings for each movie
● If there are two peaks, output the movie_id and counts
● Peak detection: difference between counts goes from negative to positive
© Cloudera, Inc. All rights reserved.
Crunch background
●Stack up functions until a group-by operation to make a map phase
●Similarly, stack up functions after a group-by to make a reduce phase
●Additional group-by operations set up more MR rounds automatically
PTable<Long, Double> table = collection
.by(new GetMovieID(), Avros.longs())
.mapValues(new GetRating(), Avros.ints())
.groupByKey()
.mapValues(new AverageRating(), Avros.doubles());
© Cloudera, Inc. All rights reserved.
Lab #7 - Analyze ratings with Crunch
http://tiny.cloudera.com/StrataLab7
© Cloudera, Inc. All rights reserved.
Thank you
blue@cloudera.com
joey@rocana.com
tom@cloudera.com
http://ingest.tips/

Weitere ähnliche Inhalte

Was ist angesagt?

Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...SpringPeople
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike WayAerospike, Inc.
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike ArchitecturePeter Milne
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valleymarkgrover
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014spinningmatt
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetupWei Ting Chen
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsAlluxio, Inc.
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...In-Memory Computing Summit
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Wei Gong
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila finalWei Ting Chen
 

Was ist angesagt? (20)

Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
Best Practices for Administering Hadoop with Hortonworks Data Platform (HDP) ...
 
Distributing Data The Aerospike Way
Distributing Data The Aerospike WayDistributing Data The Aerospike Way
Distributing Data The Aerospike Way
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Aerospike Architecture
Aerospike ArchitectureAerospike Architecture
Aerospike Architecture
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast AnalyticsGetting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
Getting Started with Apache Spark and Alluxio for Blazingly Fast Analytics
 
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
IMCSummit 2015 - Day 2 Developer Track - Anatomy of an In-Memory Data Fabric:...
 
Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...Spectrum Scale - Diversified analytic solution based on various storage servi...
Spectrum Scale - Diversified analytic solution based on various storage servi...
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
20151027 sahara + manila final
20151027 sahara + manila final20151027 sahara + manila final
20151027 sahara + manila final
 

Andere mochten auch

Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteThe Hive
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
What is UX and how can it help your organisation?
What is UX and how can it help your organisation?What is UX and how can it help your organisation?
What is UX and how can it help your organisation?Ned Potter
 
Mobile UX for Academic Libraries
Mobile UX for Academic LibrariesMobile UX for Academic Libraries
Mobile UX for Academic LibrariesKevin Rundblad
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestionVinod Nayal
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesNed Potter
 

Andere mochten auch (7)

Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom White
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
What is UX and how can it help your organisation?
What is UX and how can it help your organisation?What is UX and how can it help your organisation?
What is UX and how can it help your organisation?
 
Mobile UX for Academic Libraries
Mobile UX for Academic LibrariesMobile UX for Academic Libraries
Mobile UX for Academic Libraries
 
Hadoop data ingestion
Hadoop data ingestionHadoop data ingestion
Hadoop data ingestion
 
UX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and ArchivesUX, ethnography and possibilities: for Libraries, Museums and Archives
UX, ethnography and possibilities: for Libraries, Museums and Archives
 

Ähnlich wie Building an Apache Hadoop data application

Kite (Big Data Applications Meetup @ Cask)
Kite (Big Data Applications Meetup @ Cask)Kite (Big Data Applications Meetup @ Cask)
Kite (Big Data Applications Meetup @ Cask)_blue
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Stefan Lipp
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
How to integrate OpenStack Swift to your "legacy" system
How to integrate OpenStack Swift to your "legacy" systemHow to integrate OpenStack Swift to your "legacy" system
How to integrate OpenStack Swift to your "legacy" systemMasaaki Nakagawa
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Julien Le Dem
 
Webinar: The Future of Hadoop
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of HadoopCloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo OverviewBill Havanki
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningStorage Switzerland
 
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017Cloud Native Day Tel Aviv
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityDinesh Chitlangia
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 

Ähnlich wie Building an Apache Hadoop data application (20)

Kite (Big Data Applications Meetup @ Cask)
Kite (Big Data Applications Meetup @ Cask)Kite (Big Data Applications Meetup @ Cask)
Kite (Big Data Applications Meetup @ Cask)
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
How to integrate OpenStack Swift to your "legacy" system
How to integrate OpenStack Swift to your "legacy" systemHow to integrate OpenStack Swift to your "legacy" system
How to integrate OpenStack Swift to your "legacy" system
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Webinar: The Future of Hadoop
Webinar: The Future of HadoopWebinar: The Future of Hadoop
Webinar: The Future of Hadoop
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
The great 8 of ODA
The great 8 of ODAThe great 8 of ODA
The great 8 of ODA
 
Apache Accumulo Overview
Apache Accumulo OverviewApache Accumulo Overview
Apache Accumulo Overview
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
 
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
OpenStack and NetApp - Chen Reuven - OpenStack Day Israel 2017
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 

Kürzlich hochgeladen

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Kürzlich hochgeladen (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

Building an Apache Hadoop data application

  • 1. © Cloudera, Inc. All rights reserved. Lab #1 - VM setup http://tiny.cloudera.com/StrataLab1 Lab #2 - Create a movies dataset http://tiny.cloudera.com/StrataLab2
  • 2. © Cloudera, Inc. All rights reserved. Strata London 2015 Building an Apache Hadoop Data Application Ryan Blue, Joey Echeverria, Tom White
  • 3. © Cloudera, Inc. All rights reserved. Content for today’s tutorial ●The Hadoop Ecosystem ●Storage on Hadoop ●Movie ratings app: Data ingest ●Movie ratings app: Data analysis
  • 4. © Cloudera, Inc. All rights reserved. The Hadoop Ecosystem
  • 5. © Cloudera, Inc. All rights reserved. A Hadoop Stack
  • 6. © Cloudera, Inc. All rights reserved. Processing frameworks ●Code: MapReduce, Crunch, Spark, Tez ●SQL: Hive, Impala, Phoenix, Trafodian, Drill, Presto ●Tuples: Cascading, Pig ●Streaming: Spark streaming (micro-batch), Storm, Samza
  • 7. © Cloudera, Inc. All rights reserved. Coding frameworks ●Crunch ● A layer around MR (or Spark) that simplifies writing pipelines ●Spark ● A completely new framework for processing pipelines ● Takes advantage of memory, runs a DAG without extra map phases ●Tez ● DAG-based, like Spark’s execution engine without user-level API
  • 8. © Cloudera, Inc. All rights reserved. SQL on Hadoop ●Hive for batch processing ●Impala for low-latency queries ●Phoenix and Trafodion for transactional queries on HBase
  • 9. © Cloudera, Inc. All rights reserved. Ingest tools ●Relational: Sqoop, Sqoop2 ●Record channel: Kafka, Flume ●Files: NiFi ●Numerous commercial options
  • 10. © Cloudera, Inc. All rights reserved. Ingest tools ●Relational: Sqoop, Sqoop2 ●Record channel: Kafka, Flume ●Files: NiFi Database App HDFS Hadoop FlumeFlume
  • 11. © Cloudera, Inc. All rights reserved. Relational DB to Hadoop ●Sqoop ● CLI to run MR-based import jobs ●Sqoop2 ● Fixes configuration problems with Sqoop with credentials service ● More flexible to run on non-MR frameworks ● New and under active development
  • 12. © Cloudera, Inc. All rights reserved. Ingest tools ●Relational: Sqoop, Sqoop2 ●Record channel: Kafka, Flume ●Files: NiFi Database App HDFS Hadoop FlumeFlumeChannels
  • 13. © Cloudera, Inc. All rights reserved. Record streams to Hadoop ●Flume - source, channel, sink architecture ● Well-established and integrated with other tools ● No order guarantee, duplicates are possible ●Kafka - pub-sub model for low latencies ● Partitioned, provides ordering guarantees, easier to eliminate duplicates ● More resilient to node failure with consumer groups
  • 14. © Cloudera, Inc. All rights reserved. Files to Hadoop ●NiFi ● Web GUI for drag & drop configuration of a data flow ● Enterprise features: back-pressure, monitoring, lineage, etc. ● Integration to and from spool directory, HTTP, FTP, SFTP, and HDFS ● Originally for files, capable of handling record-based streams ● Currently in the Apache Incubator, and widely deployed privately
  • 15. © Cloudera, Inc. All rights reserved. NiFi
  • 16. © Cloudera, Inc. All rights reserved. Data storage in Hadoop
  • 17. © Cloudera, Inc. All rights reserved. data1.avro ... HDFS Blocks ●Blocks ● Increase parallelism ● Balance work ● Replicated ●Configured by dfs.blocksize ● Client-side setting data2.avro
  • 18. © Cloudera, Inc. All rights reserved. Splittable File Formats ●Splittable: Able to process part of a file ● Process blocks in parallel ●Avro is splittable ●Gzipped content is not splittable ●CSV is effectively not splittable
  • 19. © Cloudera, Inc. All rights reserved. File formats ●Existing formats: XML, JSON, Protobuf, Thrift ●Designed for Hadoop: SequenceFile, RCFile, ORC ●Makes me sad: Delimited text ●Recommended: Avro or Parquet
  • 20. © Cloudera, Inc. All rights reserved. Avro ●Recommended row-oriented format ● Broken into blocks with sync markers for splitting ● Binary encoding with block-level compression ●Avro schema ● Required to read any binary-encoded data! ● Written in the file header ●Flexible object models
  • 21. © Cloudera, Inc. All rights reserved. Avro in-memory object models ●generic ● Object model that can be used with any schema ●specific - compile schema to java object ● Generates type-safe runtime objects ●reflect - java object to schema ● Uses existing classes and objects
  • 22. © Cloudera, Inc. All rights reserved. Lab #3 - Using avro-tools http://tiny.cloudera.com/StrataLab3
  • 23. © Cloudera, Inc. All rights reserved. Row- and column-oriented formats ●Able to reduce I/O when projecting columns ●Better encoding and compression Images © Twitter, Inc. https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  • 24. © Cloudera, Inc. All rights reserved. Parquet ●Recommended column-oriented format ● Splittable by organizing into row groups ● Efficient binary encoding, supports compression ●Uses other object models ● Record construction API rather than object model ● parquet-avro - Use Avro schemas with generic or specific records ● parquet-protobuf, parquet-thrift, parquet-hive, etc.
  • 25. © Cloudera, Inc. All rights reserved. Parquet trade-offs ●Rows are buffered into groups that target a final size ●Row group size ● Memory consumption grows with row group size ● Larger groups get more I/O benefit and better encoding ●Memory consumption grows for each open file
  • 26. © Cloudera, Inc. All rights reserved. Lab #4 - Using parquet-tools http://tiny.cloudera.com/StrataLab4
  • 27. © Cloudera, Inc. All rights reserved. Partitioning ●Splittable file formats aren’t enough ●Not processing data is better than processing in parallel ●Organize data to avoid processing: Partitioning ●Use HDFS paths for a coarse index: data/y=2015/m=03/d=14/
  • 28. © Cloudera, Inc. All rights reserved. Partitioning Caution ●Partitioning in HDFS is the primary index to data ● Should reflect the most common access pattern ● Test partition strategies for multiple workloads ●Should balance file size with workload ● Lots of small files are bad for HDFS - partitioning should be more coarse ● Larger files take longer to find data - partitioning should be more specific
  • 29. © Cloudera, Inc. All rights reserved. Implementing partitioning ●Build your own - not recommended ●Hive and Impala managed ● Partitions are treated as data columns ● Insert statements must include partition calculations ●Kite managed ● Partition strategy configuration file ● Compatible with Hive and Impala
  • 30. © Cloudera, Inc. All rights reserved. Kite ●High-level data API for Hadoop ● Built around datasets, not files ● Tasks like partitioning are done internally ●Tools built around the data API ● Command-line ● Integration in Flume, Sqoop, NiFi, etc.
  • 31. © Cloudera, Inc. All rights reserved. Lab #5 - Create a partitioned dataset http://tiny.cloudera.com/StrataLab5
  • 32. © Cloudera, Inc. All rights reserved. Movie ratings app: Data ingest pipeline
  • 33. © Cloudera, Inc. All rights reserved. Movie ratings scenario ●Your company runs a web application where users can rate movies ●You want to use Hadoop to analyze ratings over time ● Avoid scraping the production database for changes ● Instead, you want to log every rating submitted Database Ratings App
  • 34. © Cloudera, Inc. All rights reserved. Movie ratings app ●Log ratings to Flume ●Otherwise unchanged Database Ratings App HDFSRatings dataset Hadoop FlumeFlumeFlume
  • 35. © Cloudera, Inc. All rights reserved. Lab #6 - Create a Flume pipeline http://tiny.cloudera.com/StrataLab6
  • 36. © Cloudera, Inc. All rights reserved. Movie ratings app: Analyzing ratings data
  • 37. © Cloudera, Inc. All rights reserved. ●Now you have several months of data ●You can query it in Hive and Impala for most cases ●Some questions are difficult to formulate as SQL ● Are there any movies that people either love or hate? Movie ratings analysis
  • 38. © Cloudera, Inc. All rights reserved. Analyzing ratings ●Map ● Extract key, movie_id, and value, rating ●Reduce: ● Reduce groups all of the ratings by movie_id ● Count the number of ratings for each movie ● If there are two peaks, output the movie_id and counts ● Peak detection: difference between counts goes from negative to positive
  • 39. © Cloudera, Inc. All rights reserved. Crunch background ●Stack up functions until a group-by operation to make a map phase ●Similarly, stack up functions after a group-by to make a reduce phase ●Additional group-by operations set up more MR rounds automatically PTable<Long, Double> table = collection .by(new GetMovieID(), Avros.longs()) .mapValues(new GetRating(), Avros.ints()) .groupByKey() .mapValues(new AverageRating(), Avros.doubles());
  • 40. © Cloudera, Inc. All rights reserved. Lab #7 - Analyze ratings with Crunch http://tiny.cloudera.com/StrataLab7
  • 41. © Cloudera, Inc. All rights reserved. Thank you blue@cloudera.com joey@rocana.com tom@cloudera.com http://ingest.tips/