SlideShare a Scribd company logo
1 of 39
Download to read offline
An Introduction to Hadoop, Mahout & HBase

          Lukáš Vlček, JBUG Brno, May 2012
Hadoop

• Open Source (ASL2) implementation of
  Google's MapReduce[1] and Google
  DFS (Distributed File System) [2]
  (~from 2006 Lucene subproject)
 [1] MapReduce: Simplified Data Processing on Large Scale Clusters by
 Jeffrey Dean and Sanjay Ghemawat, Google labs, 2004

 [2] The Google File System by Sanjay Ghemawat, Howard Gobioff, Shun-Tak
 Leung, 2003
Hadoop

• MapReduce
      Simple programming model for data processing
 putting parallel large-scale data analysis into hands of
 masses.
• HDFS
      A filesystem designed for storing large files with
 streaming data access patterns, running on clusters of
 commodity hardware.
• (Common, + other related projects...)
MapReduce
      programming model



map      (k1,v1)         →   list(k2,v2)
reduce   (k2,list(v2))   →   list(v3)
MapReduce
           “Hello World”


• Counting the number of occurrences of
  each word in collection of documents.
MapReduce
        “Hello World”
      map(k1,v1) → list(k2,v2)

map(key, value){
     // key: document name
     // value: document content
     for each word w in value {
            emitIntermediate(w,1)
     }
}
MapReduce
         “Hello World”
   reduce(k2,list(v2)) → list(v3)
reduce(key, values){
     // key: a word
     // value: a list of counts
     int result = 0;
     for each word v in values {
            result += v;
     }
     emit(result);
}
MapReduce – benefits

• The model is easy to use
    Steep learning curve.

• Many problems are expressible as
  MapReduce computations

• MapReduce scales to large clusters
    Such model makes it easy to parallelize and distribute
    your computation to thousands of machines!
MapReduce – downsides

• The model is easy to use
    People tend to try the simplest approach first.

• Many problems are expressible as
  MapReduce computations
    MapReduce may not be the best model for you.

• MapReduce scales to large clusters
    It is so easy to overload the cluster with simple code.
More elaborated examples

• Distributed PageRank
• Distributed Dijkstra's algorithm (almost)


 Lectures to Google software engineering interns,
 Summer 2007
 http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
Google PageRank today?

• Aug, 2009: Google moved away from
  MapReduce back-end indexing system
  onto a new search architecture, a.k.a.
  Caffeine.

 http://en.wikipedia.org/wiki/Google_Search#Google_Caffeine
Google Maps?

• “In particular, for large road networks it
  would be prohibitive to precompute and
  store shortest paths between all pairs of
  nodes.”

  Engineering Fast Route Planning Algorithms, by Peter
  Sanders and Dominik Schultes, 2007
  http://algo2.iti.kit.edu/documents/routeplanning/weaOverview.pdf
Demand for Real-Time data

• MapReduce batch oriented processing
  of (large) data does not fit well into
  growing demand for real-time data.

• Hybrid approach?
 Ted Dunning on Twitter's Storm:
 http://info.mapr.com/ted-storm-2012-03.html
 http://www.youtube.com/channel/UCDbTR_Z_k-
 EZ4e3JpG9zmhg
Mahout

• The goal is to build open source
  scalable machine learning libraries.

 Started by: Isabel Drost, Grant Ingersoll, Karl Wettin

 Map-Reduce for Machine Learning on Multicore, 2006
 http://www.cs.stanford.edu/people/ang//papers/nips06-
 mapreducemulticore.pdf
Implemented Algorithms
• Classification
• Clustering
• Pattern Mining
• Regression
• Dimension Reduction
• Evolutionary Algorithms
• Recommenders / Collaborative Filtering
• Vector Similarity
• ...
Back to Hadoop

    HDFS
HDFS

• Very large files
     Up to GB and PT.
• Streaming data access
     Write once, read many times.
     Optimized for high data throughput.
     Read involves large portion of the data.
• Commodity HW
     Clusters made of cheap and low reliable machines.
     Chance of failure of individual machine is high.
HDFS – don't!

• Low-latency data access
     HBase is better for low-latency access.
• Lots of small files
     NameNode memory limit.
• Multiple writes and file modifications
     Single file writer.
     Write at the end of file.
HDFS: High Availability

• Currently being added to the trunk

 http://www.cloudera.com/blog/2012/03/high-availability-for-
 the-hadoop-distributed-file-system-hdfs/
HDFS: Security & File Appends

• Finally available as well, but probably in
  different branches.

  http://www.cloudera.com/blog/2012/01/an-update-on-
  apache-hadoop-1-0/

  http://www.cloudera.com/blog/2009/07/file-appends-in-
  hdfs/
Improved POSIX support

• Available from third party vendors
  (for example MapR M3 or M5 edition)
HBase

• Non-relational, auto re-balancing, fault
  tolerant distributed database
• Modeled after Google BigTable
• Initial prototype in 2007
• Canonical use case webtable

 Bigtable: A Distributed Storage System for Structured Data, many authors,
 Google labs (2006)
HBase Conceptual view




            Copyright © 2011, Lars George. All rights reserved.
HBase

• Basic operations:
      Get, Put, Scan, Delete
• A {row, column, version} identify a cell
• Allows run Hadoop's MapReduce jobs
• Optimized for high throughput
Use Case: Real-time HBase
          Analytics
• Nice use case for real-time analysis by
  Sematext

 http://blog.sematext.com/2012/04/22/hbase-real-time-
 analytics-rollbacks-via-append-based-updates/

 http://blog.sematext.com/2012/04/27/hbase-real-time-
 analytics-rollbacks-via-append-based-updates-part-2/
Use Case: Messaging Platform

• Facebook implemented messaging
  system using HBase

 http://www.facebook.com/notes/facebook-engineering/the-
 underlying-technology-of-messages/454991608919
Hadoop, Mahout and/or HBase
        is used by...
• Amazon (A9), Adobe, Ebay, Facebook,
  Google (university program), IBM,
  Infochimps, Krugle, Last.fm, LinkedIn,
  Microsoft, Rackspace, RapLeaf, Spotify,
  StumbleUpon, Twitter, Yahoo!

 … many more!
More Resources

• Hadoop: http://hadoop.apache.org/
• Mahout: http://mahout.apache.org/
• HBase:
  http://hbase.apache.org/book/book.html
Thank you!
Photo Sources
•   http://www.flickr.com/photos/renwest/4909849477/ By renwest, CC BY-NC-SA 2.0
•   http://www.flickr.com/photos/asianartsandiego/4838273718/ By Asian Curator at The
    San Diego Museum of Art, CC BY-NC-ND 2.0
•   http://www.flickr.com/photos/zeepack/2932405424/ By ZeePack, CC BY-ND 2.0
•   http://www.flickr.com/photos/16516252@N00/3132303565/ By blueboy1478, CC BY-
    ND 2.0
Backup Slides: Anatomy of
    MapReduce Execution
• http://code.google.com/edu/parallel/map
  reduce-tutorial.html#MRExec
Backup Slides: HDFS
          Architecture
• http://hadoop.apache.org/common/docs
  /current/hdfs_design.html

More Related Content

What's hot

Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopPranab Ghosh
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQLEdward Yoon
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesSrivatsan Ramanujam
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learningMehdi Shibahara
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MDDonald Miner
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming DataGeoffrey Fox
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Edward Yoon
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 

What's hot (20)

All thingspython@pivotal
All thingspython@pivotalAll thingspython@pivotal
All thingspython@pivotal
 
Recommendation Engine Powered by Hadoop
Recommendation Engine Powered by HadoopRecommendation Engine Powered by Hadoop
Recommendation Engine Powered by Hadoop
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
Quick Understanding of NoSQL
Quick Understanding of NoSQLQuick Understanding of NoSQL
Quick Understanding of NoSQL
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop
HadoopHadoop
Hadoop
 
Analyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity FuturesAnalyzing Power of Tweets in Predicting Commodity Futures
Analyzing Power of Tweets in Predicting Commodity Futures
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Distributed deep learning
Distributed deep learningDistributed deep learning
Distributed deep learning
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
High Performance Processing of Streaming Data
High Performance Processing of Streaming DataHigh Performance Processing of Streaming Data
High Performance Processing of Streaming Data
 
Introduction to apache horn (incubating)
Introduction to apache horn (incubating)Introduction to apache horn (incubating)
Introduction to apache horn (incubating)
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 

Viewers also liked

OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
HBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large MatricesHBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large MatricesCloudera, Inc.
 
China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...Qianzhan Intelligence
 
"A Single Man": Choosing Life in a Nietzschean Context
"A Single Man": Choosing Life in a Nietzschean Context"A Single Man": Choosing Life in a Nietzschean Context
"A Single Man": Choosing Life in a Nietzschean ContextYoav Francis
 
Few words about happiness (Polish talk) / O szczęściu słów kilka
Few words about happiness (Polish talk) / O szczęściu słów kilkaFew words about happiness (Polish talk) / O szczęściu słów kilka
Few words about happiness (Polish talk) / O szczęściu słów kilkaTomek Borek
 
"Школа дошколят" МБОУ «ЦО №23»
"Школа дошколят" МБОУ «ЦО №23»"Школа дошколят" МБОУ «ЦО №23»
"Школа дошколят" МБОУ «ЦО №23»shlyop
 
Backup als Dienstleistung verkaufen Henning Meyer
Backup als Dienstleistung verkaufen   Henning MeyerBackup als Dienstleistung verkaufen   Henning Meyer
Backup als Dienstleistung verkaufen Henning MeyerMAX2014DACH
 
China luxury industry market demand and investment forecast report, 2013 2017
China luxury industry market demand and investment forecast report, 2013 2017China luxury industry market demand and investment forecast report, 2013 2017
China luxury industry market demand and investment forecast report, 2013 2017Qianzhan Intelligence
 
China animal husbandry indepth research and investment forecast report
China animal husbandry indepth research and investment forecast reportChina animal husbandry indepth research and investment forecast report
China animal husbandry indepth research and investment forecast reportQianzhan Intelligence
 
Angular 2 Crash Course with TypeScript
Angular 2 Crash Course with TypeScriptAngular 2 Crash Course with TypeScript
Angular 2 Crash Course with TypeScriptayman diab
 
Les cahiers de l’ant Créer et/ou animer votre page Facebook
Les cahiers de l’ant Créer et/ou animer votre page FacebookLes cahiers de l’ant Créer et/ou animer votre page Facebook
Les cahiers de l’ant Créer et/ou animer votre page FacebookEmilie Rochat
 

Viewers also liked (16)

mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
HBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large MatricesHBaseCon 2013: Using Apache HBase for Large Matrices
HBaseCon 2013: Using Apache HBase for Large Matrices
 
China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...
 
"A Single Man": Choosing Life in a Nietzschean Context
"A Single Man": Choosing Life in a Nietzschean Context"A Single Man": Choosing Life in a Nietzschean Context
"A Single Man": Choosing Life in a Nietzschean Context
 
Few words about happiness (Polish talk) / O szczęściu słów kilka
Few words about happiness (Polish talk) / O szczęściu słów kilkaFew words about happiness (Polish talk) / O szczęściu słów kilka
Few words about happiness (Polish talk) / O szczęściu słów kilka
 
THN presents13016
THN presents13016THN presents13016
THN presents13016
 
"Школа дошколят" МБОУ «ЦО №23»
"Школа дошколят" МБОУ «ЦО №23»"Школа дошколят" МБОУ «ЦО №23»
"Школа дошколят" МБОУ «ЦО №23»
 
ara
araara
ara
 
Curso Antena3 TV
Curso Antena3 TVCurso Antena3 TV
Curso Antena3 TV
 
Backup als Dienstleistung verkaufen Henning Meyer
Backup als Dienstleistung verkaufen   Henning MeyerBackup als Dienstleistung verkaufen   Henning Meyer
Backup als Dienstleistung verkaufen Henning Meyer
 
China luxury industry market demand and investment forecast report, 2013 2017
China luxury industry market demand and investment forecast report, 2013 2017China luxury industry market demand and investment forecast report, 2013 2017
China luxury industry market demand and investment forecast report, 2013 2017
 
China animal husbandry indepth research and investment forecast report
China animal husbandry indepth research and investment forecast reportChina animal husbandry indepth research and investment forecast report
China animal husbandry indepth research and investment forecast report
 
Angular 2 Crash Course with TypeScript
Angular 2 Crash Course with TypeScriptAngular 2 Crash Course with TypeScript
Angular 2 Crash Course with TypeScript
 
Les cahiers de l’ant Créer et/ou animer votre page Facebook
Les cahiers de l’ant Créer et/ou animer votre page FacebookLes cahiers de l’ant Créer et/ou animer votre page Facebook
Les cahiers de l’ant Créer et/ou animer votre page Facebook
 
Population
PopulationPopulation
Population
 

Similar to Introduction to Hadoop, Mahout & HBase

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveSharjeel Imtiaz
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 

Similar to Introduction to Hadoop, Mahout & HBase (20)

Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop
HadoopHadoop
Hadoop
 
Anju
AnjuAnju
Anju
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
hadoop
hadoophadoop
hadoop
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 

More from Lukas Vlcek

Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftLukas Vlcek
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015Lukas Vlcek
 
Elasticsearch @JBoss.org, 2014
Elasticsearch @JBoss.org, 2014Elasticsearch @JBoss.org, 2014
Elasticsearch @JBoss.org, 2014Lukas Vlcek
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearchLukas Vlcek
 
Compass Framework
Compass FrameworkCompass Framework
Compass FrameworkLukas Vlcek
 

More from Lukas Vlcek (7)

Elasticsearch Monitoring in Openshift
Elasticsearch Monitoring in OpenshiftElasticsearch Monitoring in Openshift
Elasticsearch Monitoring in Openshift
 
JBug_React_and_Flux_2015
JBug_React_and_Flux_2015JBug_React_and_Flux_2015
JBug_React_and_Flux_2015
 
Elasticsearch @JBoss.org, 2014
Elasticsearch @JBoss.org, 2014Elasticsearch @JBoss.org, 2014
Elasticsearch @JBoss.org, 2014
 
Building search app with ElasticSearch
Building search app with ElasticSearchBuilding search app with ElasticSearch
Building search app with ElasticSearch
 
Elastic Search
Elastic SearchElastic Search
Elastic Search
 
JBoss Snowdrop
JBoss SnowdropJBoss Snowdrop
JBoss Snowdrop
 
Compass Framework
Compass FrameworkCompass Framework
Compass Framework
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Introduction to Hadoop, Mahout & HBase

  • 1. An Introduction to Hadoop, Mahout & HBase Lukáš Vlček, JBUG Brno, May 2012
  • 2.
  • 3.
  • 4. Hadoop • Open Source (ASL2) implementation of Google's MapReduce[1] and Google DFS (Distributed File System) [2] (~from 2006 Lucene subproject) [1] MapReduce: Simplified Data Processing on Large Scale Clusters by Jeffrey Dean and Sanjay Ghemawat, Google labs, 2004 [2] The Google File System by Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, 2003
  • 5. Hadoop • MapReduce Simple programming model for data processing putting parallel large-scale data analysis into hands of masses. • HDFS A filesystem designed for storing large files with streaming data access patterns, running on clusters of commodity hardware. • (Common, + other related projects...)
  • 6. MapReduce programming model map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(v3)
  • 7. MapReduce “Hello World” • Counting the number of occurrences of each word in collection of documents.
  • 8. MapReduce “Hello World” map(k1,v1) → list(k2,v2) map(key, value){ // key: document name // value: document content for each word w in value { emitIntermediate(w,1) } }
  • 9. MapReduce “Hello World” reduce(k2,list(v2)) → list(v3) reduce(key, values){ // key: a word // value: a list of counts int result = 0; for each word v in values { result += v; } emit(result); }
  • 10. MapReduce – benefits • The model is easy to use Steep learning curve. • Many problems are expressible as MapReduce computations • MapReduce scales to large clusters Such model makes it easy to parallelize and distribute your computation to thousands of machines!
  • 11. MapReduce – downsides • The model is easy to use People tend to try the simplest approach first. • Many problems are expressible as MapReduce computations MapReduce may not be the best model for you. • MapReduce scales to large clusters It is so easy to overload the cluster with simple code.
  • 12. More elaborated examples • Distributed PageRank • Distributed Dijkstra's algorithm (almost) Lectures to Google software engineering interns, Summer 2007 http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html
  • 13. Google PageRank today? • Aug, 2009: Google moved away from MapReduce back-end indexing system onto a new search architecture, a.k.a. Caffeine. http://en.wikipedia.org/wiki/Google_Search#Google_Caffeine
  • 14. Google Maps? • “In particular, for large road networks it would be prohibitive to precompute and store shortest paths between all pairs of nodes.” Engineering Fast Route Planning Algorithms, by Peter Sanders and Dominik Schultes, 2007 http://algo2.iti.kit.edu/documents/routeplanning/weaOverview.pdf
  • 15. Demand for Real-Time data • MapReduce batch oriented processing of (large) data does not fit well into growing demand for real-time data. • Hybrid approach? Ted Dunning on Twitter's Storm: http://info.mapr.com/ted-storm-2012-03.html http://www.youtube.com/channel/UCDbTR_Z_k- EZ4e3JpG9zmhg
  • 16.
  • 17.
  • 18.
  • 19. Mahout • The goal is to build open source scalable machine learning libraries. Started by: Isabel Drost, Grant Ingersoll, Karl Wettin Map-Reduce for Machine Learning on Multicore, 2006 http://www.cs.stanford.edu/people/ang//papers/nips06- mapreducemulticore.pdf
  • 20. Implemented Algorithms • Classification • Clustering • Pattern Mining • Regression • Dimension Reduction • Evolutionary Algorithms • Recommenders / Collaborative Filtering • Vector Similarity • ...
  • 22. HDFS • Very large files Up to GB and PT. • Streaming data access Write once, read many times. Optimized for high data throughput. Read involves large portion of the data. • Commodity HW Clusters made of cheap and low reliable machines. Chance of failure of individual machine is high.
  • 23. HDFS – don't! • Low-latency data access HBase is better for low-latency access. • Lots of small files NameNode memory limit. • Multiple writes and file modifications Single file writer. Write at the end of file.
  • 24. HDFS: High Availability • Currently being added to the trunk http://www.cloudera.com/blog/2012/03/high-availability-for- the-hadoop-distributed-file-system-hdfs/
  • 25. HDFS: Security & File Appends • Finally available as well, but probably in different branches. http://www.cloudera.com/blog/2012/01/an-update-on- apache-hadoop-1-0/ http://www.cloudera.com/blog/2009/07/file-appends-in- hdfs/
  • 26. Improved POSIX support • Available from third party vendors (for example MapR M3 or M5 edition)
  • 27.
  • 28.
  • 29. HBase • Non-relational, auto re-balancing, fault tolerant distributed database • Modeled after Google BigTable • Initial prototype in 2007 • Canonical use case webtable Bigtable: A Distributed Storage System for Structured Data, many authors, Google labs (2006)
  • 30. HBase Conceptual view Copyright © 2011, Lars George. All rights reserved.
  • 31. HBase • Basic operations: Get, Put, Scan, Delete • A {row, column, version} identify a cell • Allows run Hadoop's MapReduce jobs • Optimized for high throughput
  • 32. Use Case: Real-time HBase Analytics • Nice use case for real-time analysis by Sematext http://blog.sematext.com/2012/04/22/hbase-real-time- analytics-rollbacks-via-append-based-updates/ http://blog.sematext.com/2012/04/27/hbase-real-time- analytics-rollbacks-via-append-based-updates-part-2/
  • 33. Use Case: Messaging Platform • Facebook implemented messaging system using HBase http://www.facebook.com/notes/facebook-engineering/the- underlying-technology-of-messages/454991608919
  • 34. Hadoop, Mahout and/or HBase is used by... • Amazon (A9), Adobe, Ebay, Facebook, Google (university program), IBM, Infochimps, Krugle, Last.fm, LinkedIn, Microsoft, Rackspace, RapLeaf, Spotify, StumbleUpon, Twitter, Yahoo! … many more!
  • 35. More Resources • Hadoop: http://hadoop.apache.org/ • Mahout: http://mahout.apache.org/ • HBase: http://hbase.apache.org/book/book.html
  • 37. Photo Sources • http://www.flickr.com/photos/renwest/4909849477/ By renwest, CC BY-NC-SA 2.0 • http://www.flickr.com/photos/asianartsandiego/4838273718/ By Asian Curator at The San Diego Museum of Art, CC BY-NC-ND 2.0 • http://www.flickr.com/photos/zeepack/2932405424/ By ZeePack, CC BY-ND 2.0 • http://www.flickr.com/photos/16516252@N00/3132303565/ By blueboy1478, CC BY- ND 2.0
  • 38. Backup Slides: Anatomy of MapReduce Execution • http://code.google.com/edu/parallel/map reduce-tutorial.html#MRExec
  • 39. Backup Slides: HDFS Architecture • http://hadoop.apache.org/common/docs /current/hdfs_design.html