SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
R and Hadoop

    Ram Venkat
   Dawn Analytics
What is Hadoop?
•   Hadoop is an open source Apache software for
    running distributed applications on 'big data'
•   It contains a distributed file system (HDFS) and a
    parallel processing 'batch' framework
•   Hadoop is written in java, runs on unix/linux for
    development and production
•   Windows and Mac can be used as development
    platform
•   Yahoo has > 43000 nodes hadoop cluster and
    Facebook has over 100 PB(PB= 1 M GB) of data in
    hadoop clusters
Hadoop overview (1/2)
     Central Idea: Moving computation to data and
    compute across nodes in parallel
•   Data Loading
Hadoop overview (2/2)
Parallel Computation: MapReduce
Map Reduce : Example 'Hello word'
•   Mathematically, this is what MapReduce is about:
    —map (k1, v1) ➔ list(k2, v2)
    —reduce(k2, list(v2)) ➔ list(<k3, v3>)
•   Implementation of the 'hello word' (word count):
    Mapper: K1 -> file name, v1 -> text of the file
             K2 -> word, V2 -> “1”
    Reducer: Sums up the '1' s and produces a list of
    words and their counts
Word Count (slide from Yahoo)
R libraries to work with Hadoop
•    'Hadoop Streaming' - An alternative to the Java
     MapReduce API
•    Hadoop Streaming allows you to write jobs in any
     language supporting stdin/stdout.
•    R has several libraries/ways that help you to work with
     Hadoop:
      – Write your mapper.R and reducer.R and run a shell
        script
      – 'rmr' and 'rhadoop' from revolution analytics
     – 'rhipe' from Purdue University statistical computing
     – 'RHive' interacts with Hadoop via Hive query
Word Count Demo with R(rmr)
  mapper.wordcount = function(key, val) {

                lapply(
                  strsplit( x = val, split = " ")[[1]],
                  function(w) keyval(w,1)
                )
  }

   reducer.wordcount = function(key, val.list) {
             output.key = key
             output.val = sum(unlist(val.list))
                return( keyval(output.key, output.val) )
   }
More advanced example –
    Sentiment Analysis in R(rmr)
•   One area where Hadoop could help out traders is in
    sentiment analysis
•   Oreilly Strata blog 'Trading on sentiment' :
    http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html

•   Demo2 is modified code from this example from Jeffrey
    Breen on airlines sentiment analysis :
    http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

•   Jeffrey has been very active in R groups in Chicago
    area, This is another tutorial last month on R and
    Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started-
    with-r-hadoop
Demo2 Sentiment Analysis with rmr
 mapper.score = function(key, val) {

 # clean up tweets with R's regex-driven global substitute, gsub():
   val = gsub('[[:punct:]]', '', val)
   val = gsub('[[:cntrl:]]', '', val)
   val = gsub('d+', '', val)

 # Key is the Airline we added as tag to the tweets
   airline = substr(val,1,2)

 # Run the sentiment analysis
   output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words))

 # our interest is in computing the counts by airlines and scores, so 'this' count is 1
   output.val = 1
   return( keyval(output.key, output.val) )
 }
Demo3 - Hive

•   Hive is a data warehousing infrastructure for Hadoop
•   Provides a familiar SQL like interface to create tables,
    insert and query data
•   Behind the scene , it implements map-reduce
•    Hive is an alternative to our hadoop streaming we
    covered before
•   Demo3 – stock query with Hive
Use cases for Traders
•    Stock sentiment analysis
•    Stock trading pattern analysis
•    Default prediction
•    Fraud/anomoly detection
•    NextGen data warehousing
Hadoop support - Cloudera
•   Cloudera distribution of hadoop is one of the most
    popular distribution (I used their CDH3v5 in my 2
    demos above)
•   Doug Cutting, the creator of Hadoop is the architect
    with Cloudera
•   Adam Muise , a Cloudera engineer at Toronto is the
    organizer of Toronto Hadoop user Group (TOHUG)
•   Upcoming meeting organized by TOHUG on the 30th
    October - “PIG-fest”
Hadoop Tutorials and Books
•   http://hadoop.apache.org/docs/r0.20.2/quickstart.html
•   Cloudera: http://university.cloudera.com/
•   Book: “Hadoop in Action” – Manning
•   Book: “Hadoop - The Definitive Guide” – Oreilly
•   Hadoop Streaming:
    http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html
•   Google Code University:
    http://code.google.com/edu/parallel/mapreduce-tutorial.html
•   Yahoo's Tutorial :
    http://developer.yahoo.com/hadoop/tutorial/module1.html
Thank You

For any clarification, send e-mail to
ram@dawnanalytics.com

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
alogarg
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
soujavajug
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
alogarg
 

Was ist angesagt? (20)

Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 

Ähnlich wie How to use hadoop and r for big data parallel processing

Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 

Ähnlich wie How to use hadoop and r for big data parallel processing (20)

Hive paris
Hive parisHive paris
Hive paris
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Data Science
Data ScienceData Science
Data Science
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop
HadoopHadoop
Hadoop
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Kürzlich hochgeladen (20)

Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 

How to use hadoop and r for big data parallel processing

  • 1. R and Hadoop Ram Venkat Dawn Analytics
  • 2. What is Hadoop? • Hadoop is an open source Apache software for running distributed applications on 'big data' • It contains a distributed file system (HDFS) and a parallel processing 'batch' framework • Hadoop is written in java, runs on unix/linux for development and production • Windows and Mac can be used as development platform • Yahoo has > 43000 nodes hadoop cluster and Facebook has over 100 PB(PB= 1 M GB) of data in hadoop clusters
  • 3. Hadoop overview (1/2) Central Idea: Moving computation to data and compute across nodes in parallel • Data Loading
  • 4. Hadoop overview (2/2) Parallel Computation: MapReduce
  • 5. Map Reduce : Example 'Hello word' • Mathematically, this is what MapReduce is about: —map (k1, v1) ➔ list(k2, v2) —reduce(k2, list(v2)) ➔ list(<k3, v3>) • Implementation of the 'hello word' (word count): Mapper: K1 -> file name, v1 -> text of the file K2 -> word, V2 -> “1” Reducer: Sums up the '1' s and produces a list of words and their counts
  • 6. Word Count (slide from Yahoo)
  • 7. R libraries to work with Hadoop • 'Hadoop Streaming' - An alternative to the Java MapReduce API • Hadoop Streaming allows you to write jobs in any language supporting stdin/stdout. • R has several libraries/ways that help you to work with Hadoop: – Write your mapper.R and reducer.R and run a shell script – 'rmr' and 'rhadoop' from revolution analytics – 'rhipe' from Purdue University statistical computing – 'RHive' interacts with Hadoop via Hive query
  • 8. Word Count Demo with R(rmr) mapper.wordcount = function(key, val) { lapply( strsplit( x = val, split = " ")[[1]], function(w) keyval(w,1) ) } reducer.wordcount = function(key, val.list) { output.key = key output.val = sum(unlist(val.list)) return( keyval(output.key, output.val) ) }
  • 9. More advanced example – Sentiment Analysis in R(rmr) • One area where Hadoop could help out traders is in sentiment analysis • Oreilly Strata blog 'Trading on sentiment' : http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html • Demo2 is modified code from this example from Jeffrey Breen on airlines sentiment analysis : http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ • Jeffrey has been very active in R groups in Chicago area, This is another tutorial last month on R and Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started- with-r-hadoop
  • 10. Demo2 Sentiment Analysis with rmr mapper.score = function(key, val) { # clean up tweets with R's regex-driven global substitute, gsub(): val = gsub('[[:punct:]]', '', val) val = gsub('[[:cntrl:]]', '', val) val = gsub('d+', '', val) # Key is the Airline we added as tag to the tweets airline = substr(val,1,2) # Run the sentiment analysis output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words)) # our interest is in computing the counts by airlines and scores, so 'this' count is 1 output.val = 1 return( keyval(output.key, output.val) ) }
  • 11. Demo3 - Hive • Hive is a data warehousing infrastructure for Hadoop • Provides a familiar SQL like interface to create tables, insert and query data • Behind the scene , it implements map-reduce • Hive is an alternative to our hadoop streaming we covered before • Demo3 – stock query with Hive
  • 12. Use cases for Traders • Stock sentiment analysis • Stock trading pattern analysis • Default prediction • Fraud/anomoly detection • NextGen data warehousing
  • 13. Hadoop support - Cloudera • Cloudera distribution of hadoop is one of the most popular distribution (I used their CDH3v5 in my 2 demos above) • Doug Cutting, the creator of Hadoop is the architect with Cloudera • Adam Muise , a Cloudera engineer at Toronto is the organizer of Toronto Hadoop user Group (TOHUG) • Upcoming meeting organized by TOHUG on the 30th October - “PIG-fest”
  • 14. Hadoop Tutorials and Books • http://hadoop.apache.org/docs/r0.20.2/quickstart.html • Cloudera: http://university.cloudera.com/ • Book: “Hadoop in Action” – Manning • Book: “Hadoop - The Definitive Guide” – Oreilly • Hadoop Streaming: http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html • Google Code University: http://code.google.com/edu/parallel/mapreduce-tutorial.html • Yahoo's Tutorial : http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 15. Thank You For any clarification, send e-mail to ram@dawnanalytics.com