SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
R and Hadoop

    Ram Venkat
   Dawn Analytics
What is Hadoop?
•   Hadoop is an open source Apache software for
    running distributed applications on 'big data'
•   It contains a distributed file system (HDFS) and a
    parallel processing 'batch' framework
•   Hadoop is written in java, runs on unix/linux for
    development and production
•   Windows and Mac can be used as development
    platform
•   Yahoo has > 43000 nodes hadoop cluster and
    Facebook has over 100 PB(PB= 1 M GB) of data in
    hadoop clusters
Hadoop overview (1/2)
     Central Idea: Moving computation to data and
    compute across nodes in parallel
•   Data Loading
Hadoop overview (2/2)
Parallel Computation: MapReduce
Map Reduce : Example 'Hello word'
•   Mathematically, this is what MapReduce is about:
    —map (k1, v1) ➔ list(k2, v2)
    —reduce(k2, list(v2)) ➔ list(<k3, v3>)
•   Implementation of the 'hello word' (word count):
    Mapper: K1 -> file name, v1 -> text of the file
             K2 -> word, V2 -> “1”
    Reducer: Sums up the '1' s and produces a list of
    words and their counts
Word Count (slide from Yahoo)
R libraries to work with Hadoop
•    'Hadoop Streaming' - An alternative to the Java
     MapReduce API
•    Hadoop Streaming allows you to write jobs in any
     language supporting stdin/stdout.
•    R has several libraries/ways that help you to work with
     Hadoop:
      – Write your mapper.R and reducer.R and run a shell
        script
      – 'rmr' and 'rhadoop' from revolution analytics
     – 'rhipe' from Purdue University statistical computing
     – 'RHive' interacts with Hadoop via Hive query
Word Count Demo with R(rmr)
  mapper.wordcount = function(key, val) {

                lapply(
                  strsplit( x = val, split = " ")[[1]],
                  function(w) keyval(w,1)
                )
  }

   reducer.wordcount = function(key, val.list) {
             output.key = key
             output.val = sum(unlist(val.list))
                return( keyval(output.key, output.val) )
   }
More advanced example –
    Sentiment Analysis in R(rmr)
•   One area where Hadoop could help out traders is in
    sentiment analysis
•   Oreilly Strata blog 'Trading on sentiment' :
    http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html

•   Demo2 is modified code from this example from Jeffrey
    Breen on airlines sentiment analysis :
    http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/

•   Jeffrey has been very active in R groups in Chicago
    area, This is another tutorial last month on R and
    Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started-
    with-r-hadoop
Demo2 Sentiment Analysis with rmr
 mapper.score = function(key, val) {

 # clean up tweets with R's regex-driven global substitute, gsub():
   val = gsub('[[:punct:]]', '', val)
   val = gsub('[[:cntrl:]]', '', val)
   val = gsub('d+', '', val)

 # Key is the Airline we added as tag to the tweets
   airline = substr(val,1,2)

 # Run the sentiment analysis
   output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words))

 # our interest is in computing the counts by airlines and scores, so 'this' count is 1
   output.val = 1
   return( keyval(output.key, output.val) )
 }
Demo3 - Hive

•   Hive is a data warehousing infrastructure for Hadoop
•   Provides a familiar SQL like interface to create tables,
    insert and query data
•   Behind the scene , it implements map-reduce
•    Hive is an alternative to our hadoop streaming we
    covered before
•   Demo3 – stock query with Hive
Use cases for Traders
•    Stock sentiment analysis
•    Stock trading pattern analysis
•    Default prediction
•    Fraud/anomoly detection
•    NextGen data warehousing
Hadoop support - Cloudera
•   Cloudera distribution of hadoop is one of the most
    popular distribution (I used their CDH3v5 in my 2
    demos above)
•   Doug Cutting, the creator of Hadoop is the architect
    with Cloudera
•   Adam Muise , a Cloudera engineer at Toronto is the
    organizer of Toronto Hadoop user Group (TOHUG)
•   Upcoming meeting organized by TOHUG on the 30th
    October - “PIG-fest”
Hadoop Tutorials and Books
•   http://hadoop.apache.org/docs/r0.20.2/quickstart.html
•   Cloudera: http://university.cloudera.com/
•   Book: “Hadoop in Action” – Manning
•   Book: “Hadoop - The Definitive Guide” – Oreilly
•   Hadoop Streaming:
    http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html
•   Google Code University:
    http://code.google.com/edu/parallel/mapreduce-tutorial.html
•   Yahoo's Tutorial :
    http://developer.yahoo.com/hadoop/tutorial/module1.html
Thank You

For any clarification, send e-mail to
ram@dawnanalytics.com

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first classalogarg
 
Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014
Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014
Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014soujavajug
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveMike Frampton
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm alogarg
 
Big data and tools
Big data and tools Big data and tools
Big data and tools Shivam Shukla
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed ComputingFederico Cargnelutti
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemRajkumar Singh
 
R for data analytics
R for data analyticsR for data analytics
R for data analyticsVijayMohan Vasu
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache HadoopKMS Technology
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Sparkdatamantra
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopRevolution Analytics
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 

Was ist angesagt? (20)

Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Hadoop eco system-first class
Hadoop eco system-first classHadoop eco system-first class
Hadoop eco system-first class
 
Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014
Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014
Hadoop - Introduction to map reduce programming - ReuniĂŁo 12/04/2014
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
An introduction to Apache Hadoop Hive
An introduction to Apache Hadoop HiveAn introduction to Apache Hadoop Hive
An introduction to Apache Hadoop Hive
 
Hadoop course curriculm
Hadoop course curriculm Hadoop course curriculm
Hadoop course curriculm
 
Big data and tools
Big data and tools Big data and tools
Big data and tools
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
R for data analytics
R for data analyticsR for data analytics
R for data analytics
 
An Introduction of Apache Hadoop
An Introduction of Apache HadoopAn Introduction of Apache Hadoop
An Introduction of Apache Hadoop
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 

Ă„hnlich wie R and Hadoop: An Introduction to Using R with Hadoop for Big Data Analysis

Hive paris
Hive parisHive paris
Hive parisSzehon Ho
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Getting started big data
Getting started big dataGetting started big data
Getting started big dataKibrom Gebrehiwot
 
Data Science
Data ScienceData Science
Data ScienceSubhajit75
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & HadoopJeffrey Breen
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop PrimerSteve Staso
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshopPurna Chander
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014rpbrehm
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architectureHarikrishnan K
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 

Ă„hnlich wie R and Hadoop: An Introduction to Using R with Hadoop for Big Data Analysis (20)

Hive paris
Hive parisHive paris
Hive paris
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Data Science
Data ScienceData Science
Data Science
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Getting started with R & Hadoop
Getting started with R & HadoopGetting started with R & Hadoop
Getting started with R & Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop workshop
Hadoop workshopHadoop workshop
Hadoop workshop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Hadoop
HadoopHadoop
Hadoop
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Apache hadoop introduction and architecture
Apache hadoop  introduction and architectureApache hadoop  introduction and architecture
Apache hadoop introduction and architecture
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 

KĂĽrzlich hochgeladen

Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptxPoojaSen20
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 

KĂĽrzlich hochgeladen (20)

Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
PSYCHIATRIC History collection FORMAT.pptx
PSYCHIATRIC   History collection FORMAT.pptxPSYCHIATRIC   History collection FORMAT.pptx
PSYCHIATRIC History collection FORMAT.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 

R and Hadoop: An Introduction to Using R with Hadoop for Big Data Analysis

  • 1. R and Hadoop Ram Venkat Dawn Analytics
  • 2. What is Hadoop? • Hadoop is an open source Apache software for running distributed applications on 'big data' • It contains a distributed file system (HDFS) and a parallel processing 'batch' framework • Hadoop is written in java, runs on unix/linux for development and production • Windows and Mac can be used as development platform • Yahoo has > 43000 nodes hadoop cluster and Facebook has over 100 PB(PB= 1 M GB) of data in hadoop clusters
  • 3. Hadoop overview (1/2) Central Idea: Moving computation to data and compute across nodes in parallel • Data Loading
  • 4. Hadoop overview (2/2) Parallel Computation: MapReduce
  • 5. Map Reduce : Example 'Hello word' • Mathematically, this is what MapReduce is about: —map (k1, v1) âž” list(k2, v2) —reduce(k2, list(v2)) âž” list(<k3, v3>) • Implementation of the 'hello word' (word count): Mapper: K1 -> file name, v1 -> text of the file K2 -> word, V2 -> “1” Reducer: Sums up the '1' s and produces a list of words and their counts
  • 6. Word Count (slide from Yahoo)
  • 7. R libraries to work with Hadoop • 'Hadoop Streaming' - An alternative to the Java MapReduce API • Hadoop Streaming allows you to write jobs in any language supporting stdin/stdout. • R has several libraries/ways that help you to work with Hadoop: – Write your mapper.R and reducer.R and run a shell script – 'rmr' and 'rhadoop' from revolution analytics – 'rhipe' from Purdue University statistical computing – 'RHive' interacts with Hadoop via Hive query
  • 8. Word Count Demo with R(rmr) mapper.wordcount = function(key, val) { lapply( strsplit( x = val, split = " ")[[1]], function(w) keyval(w,1) ) } reducer.wordcount = function(key, val.list) { output.key = key output.val = sum(unlist(val.list)) return( keyval(output.key, output.val) ) }
  • 9. More advanced example – Sentiment Analysis in R(rmr) • One area where Hadoop could help out traders is in sentiment analysis • Oreilly Strata blog 'Trading on sentiment' : http://strata.oreilly.com/2011/05/sentiment-analysis-finance.html • Demo2 is modified code from this example from Jeffrey Breen on airlines sentiment analysis : http://jeffreybreen.wordpress.com/2011/07/04/twitter-text-mining-r-slides/ • Jeffrey has been very active in R groups in Chicago area, This is another tutorial last month on R and Hadoop by Jeffrey : http://www.slideshare.net/jeffreybreen/getting-started- with-r-hadoop
  • 10. Demo2 Sentiment Analysis with rmr mapper.score = function(key, val) { # clean up tweets with R's regex-driven global substitute, gsub(): val = gsub('[[:punct:]]', '', val) val = gsub('[[:cntrl:]]', '', val) val = gsub('d+', '', val) # Key is the Airline we added as tag to the tweets airline = substr(val,1,2) # Run the sentiment analysis output.key = c(as.character(airline), score.sentiment(val,pos.words,neg.words)) # our interest is in computing the counts by airlines and scores, so 'this' count is 1 output.val = 1 return( keyval(output.key, output.val) ) }
  • 11. Demo3 - Hive • Hive is a data warehousing infrastructure for Hadoop • Provides a familiar SQL like interface to create tables, insert and query data • Behind the scene , it implements map-reduce • Hive is an alternative to our hadoop streaming we covered before • Demo3 – stock query with Hive
  • 12. Use cases for Traders • Stock sentiment analysis • Stock trading pattern analysis • Default prediction • Fraud/anomoly detection • NextGen data warehousing
  • 13. Hadoop support - Cloudera • Cloudera distribution of hadoop is one of the most popular distribution (I used their CDH3v5 in my 2 demos above) • Doug Cutting, the creator of Hadoop is the architect with Cloudera • Adam Muise , a Cloudera engineer at Toronto is the organizer of Toronto Hadoop user Group (TOHUG) • Upcoming meeting organized by TOHUG on the 30th October - “PIG-fest”
  • 14. Hadoop Tutorials and Books • http://hadoop.apache.org/docs/r0.20.2/quickstart.html • Cloudera: http://university.cloudera.com/ • Book: “Hadoop in Action” – Manning • Book: “Hadoop - The Definitive Guide” – Oreilly • Hadoop Streaming: http://hadoop.apache.org/docs/mapreduce/r0.21.0/streaming.html • Google Code University: http://code.google.com/edu/parallel/mapreduce-tutorial.html • Yahoo's Tutorial : http://developer.yahoo.com/hadoop/tutorial/module1.html
  • 15. Thank You For any clarification, send e-mail to ram@dawnanalytics.com