SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Getting Started on Hadoop

Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
http://www.meetup.com/cloudcomputing/calendar/13911740/


Paco Nathan
@pacoid
http://ceteri.blogspot.com/


Examples of Hadoop Streaming, based on Python scripts
running on the AWS Elastic MapReduce service.

• first, a brief history…
• AWS Elastic MapReduce
• “WordCount” example as “Hello World” for MapReduce
• text mining Enron Email Dataset from Infochimps.com
• inverted index, semantic lexicon, social graph
• data visualization using R and Gephi

All source code for this talk is available at:

 http://github.com/ceteri/ceteri-mapred
How Does MapReduce Work?


  map(k1, v1) → list(k2, v2)
  reduce(k2, list(v2)) → list(v3)

Several phases, which partition a problem into many tasks:
 • load data into DFS…
 • map phase: input split → (key, value) pairs, with optional combiner
 • shuffle phase: sort on keys to group pairs… load-test your network!
 • reduce phase: each task receives the values for one key
 • pull data from DFS…
NB: “map” phase is required, the rest are optional.
Think of set operations on tuples (and check out Cascading.org).
Meanwhile, given all those (key, value) pairs listed above, it’s no
wonder that key/value stores have become such a popular topic of
conversation…
How Does MapReduce Work?


  map(k1, v1) → list(k2, v2)
  reduce(k2, list(v2)) → list(v3)

The property of data independence among tasks allows for highly
parallel processing… maybe, if the stars are all aligned :)

Primarily, a MapReduce framework is largely about fault tolerance, and
how to leverage “commodity hardware” to replace “big iron” solutions…

That phrase “big iron” might apply to Oracle + NetApp. Or perhaps an
IBM zSeries mainframe… Or something – expensive, undoubtably.

Bonus questions for self-admitted math geeks: Foresee any concerns
about O(n) complexity, given the functional definitions listed above?
Keep in mind that each phase cannot conclude and progress to the
next phase until after each of its tasks has successfully completed.
A Brief History…

circa 1979 – Stanford, MIT, CMU, etc.
     set/list operations in LISP, Prolog, etc., for parallel processing
    http://www-formal.stanford.edu/jmc/history/lisp/lisp.htm

circa 2004 – Google
     MapReduce: Simplified Data Processing on Large Clusters
     Jeffrey Dean and Sanjay Ghemawat
    http://labs.google.com/papers/mapreduce.html

circa 2006 – Apache
     Hadoop, originating from the Nutch Project
     Doug Cutting
    http://research.yahoo.com/files/cutting.pdf

circa 2008 – Yahoo
     web scale search indexing
     Hadoop Summit, HUG, etc.
    http://developer.yahoo.com/hadoop/

circa 2009 – Amazon AWS
     Elastic MapReduce
     Hadoop modified for EC2/S3, plus support for Hive, Pig, etc.
    http://aws.amazon.com/elasticmapreduce/
Why run Hadoop in AWS?

• elastic: batch jobs on clusters can consume many nodes,
  scalable demand, not 24/7 – great case for using EC2
• commodity hardware: MR is built for fault tolerance, great
  case for leveraging AMIs
• right-sizing: difficult to know a priori how large of a cluster
  is needed – without running significant jobs (test k/v skew,
  data quality, etc.)
• when your input data is already in S3, SDB, EBS, RDS…
• when your output needs to be consumed in AWS …
You really don't want to buy rack space in a datacenter before
assessing these issues – besides, a private datacenter probably
won’t even be cost-effective afterward.
But why run Hadoop on Elastic MapReduce?

• virtualization: Hadoop needs some mods to run well in
 that kind of environment
• pay-per-drink: absorbs cost of launching nodes
• secret sauce: Cluster Compute Instances (CCI) and
 Spot Instances (SI)
• DevOps: EMR job flow mgmt optimizes where your staff
 spends their (limited) time+capital
• logging to S3, works wonders for troubleshooting
A Tale of Two Ventures…

Adknowledge: in 2008, our team became one of the larger
use cases running Hadoop on AWS
• prior to the launch of EMR
• launching clusters of up to 100 m1.xlarge
• initially 12 hrs/day, optimized down to 4 hrs/day
• displaced $3MM capex for Netezza

ShareThis: in 2009, our team used even more Hadoop
on AWS than that previous team
• this time with EMR
• larger/more frequent jobs
• lower batch failure rate
• faster turnaround on results
• excellent support
• smaller team required
• much less budget
“WordCount”, a “Hello World” for MapReduce

Definition: count how often each word appears within a collection
of text documents.

A simple program which illustrates a pretty good test case for what
MapReduce can perform, since it incorporates:

• minimal amount code
• document feature extraction (where words are “terms”)
• symbolic and numeric values
• potential use of a combiner
• bipartite graph of (doc, term) tuples
• not so many steps away from useful indexing…
When a framework can run “WordCount” in parallel at scale, then it
can handle much larger, more interesting compute problems as well.
Bipartite Graph

Wikipedia: “…a bipartite graph is a graph whose vertices can be divided
into two disjoint sets U and V such that every edge connects a vertex in U
to one in V… ”

     http://en.wikipedia.org/wiki/Bipartite_graph




Consider the case where:


   U ≡ { documents }

   V ≡ { terms }


Many kinds of text analytics products
can be constructed based on this
data structure as a foundation.
“WordCount”, in other words…


  map(doc_id, text)
  → list(word, count)

  reduce(word, list(count))
   → list(sum_count)
“WordCount”, in other words…

  void map (String doc_id, String text):
   for each word w in segment(text):
    emitPartial(w, "1");



  void reduce (String word, Iterator partial_counts):
   int count = 0;

   for each pc in partial_counts:
    count += Int(pc);

   emitResult(String(count));
Hadoop Streaming

One way to approach MapReduce jobs in Hadoop is to use streaming.
In other words, use any kind of script which can be run from a command
line and read/write data via stdin and stdout:
    http://hadoop.apache.org/common/docs/current/streaming.html#Hadoop+Streaming


The following examples use Python scripts for Hadoop Streaming. One
really great benefit is that then you can dev/test/debug your MapReduce
code on small data sets from a command line simply by using pipes:


     cat input.txt | mapper.py | sort | reducer.py

BTW, there are much better ways to handle Hadoop Streaming in Python
on Elastic MapReduce – for example, using the “boto” library. However,
these examples are kept simple so they’ll fit into a tech talk!
“WordCount”, in other words…
“WordCount”, in other words…
“WordCount”, in other words…

  # this Linux command line...
  cat foo.txt | map_wc.py | sort | red_wc.py


  # produces output like this...
  tuple
    
  9
  term
     
  6
  tfidf
    
  6
  sort
 
   5
  analysis
 2
  wordcount
 1
  user
     
  1


  # depending on input -
  # which could be HTML content, tweets, email, etc.
Speaking of Email…

Enron pioneered innovative corporate accounting methods and energy market
manipulations, involving a baffling array of fraud techniques. The firm soared to
a valuation of over $60B (growing 56% in 1999, 87% in 2000) while inducing a
state of emergency in California – which cost the state over $40B. Subsequent
prosecution of top execs plus the meteoric decline in the firm’s 2001 share value
made for a spectacular #EPIC #FAIL
     http://en.wikipedia.org/wiki/Enron_scandal
     http://en.wikipedia.org/wiki/California_electricity_crisis


Thanks to CALO and Infochimps, we have a half million email messages
collected from Enron managers during their, um, “heyday” period:
     http://infochimps.org/datasets/enron-email-dataset--2
     http://www.cs.cmu.edu/~enron/


Let’s use Hadoop to help find out: what were some of
the things those managers were talking about?
Simple Text Analytics

Extending from how “WordCount” works, we’ll add multiple kinds of output
tuples, plus two stages of mappers and reducers, to generate different kinds
of text analytics products:

• inverted index
• co-occurrence analysis
• TF-IDF filter
• social graph

While doing that, we'll also perform other statistical
analysis and data visualization using R and Gephi
Mapper 1: RFC822 Parser

map_parse.py takes a list of URI for where to read email messages, parses
each message, then emits multiple kinds of output tuples:



   (doc_id, msg_uri, date)

   (sender, receiver, doc_id)

   (term, term_freq, doc_id)

   (term, co_term, doc_id)

Note that our dataset includes approximately 500,000 email messages, with an
average of about 100 words in each message.
Also, there are 10E+5 unique terms. That will tend to be a constant in English
texts, which is great to know when configuring capacity.
Reducer 1: TF-IDF and Co-Occurrence

red_idf.py takes the shuffled output from map_parse.py, collects metadata
for each term, calculates TF-IDF to use in a later stage for filtering, calculates
co-occurrence probability, then emits all these results:



   (doc_id, msg_uri, date)

   (sender, receiver, doc_id)

   (term, idf, count)

   (term, co_term, prob_cooc)

   (term, tfidf, doc_id)

   (term, max_tfidf)
Mapper 1 + Reducer 1
Mapper 1 Output
Reducer 1 Output
Mapper 2 + Reducer 2: Threshold Filter

map_filter.py and red_filter.py apply a threshold (based on statistical
analysis of TF-IDF) to filter results of co-occurrence analysis so that we begin to
produce a semantic lexicon for exploring the data set.


How do we determine a reasonable value for the TF-IDF threshold, for filtering
terms? Sampling from the (term, max_tfidf) tuple, we run summary stats and
visualization in R:


   cat dat.idf | util_extract.py m > thresh.tsv


We also convert the sender/receiver social graph into CSV format for Gephi
visualization:


   cat dat.parsed | util_extract.py s | util_gephi.py | sort -u > graph.csv
Mapper 2 + Reducer 2
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Job Flows…
Elastic MapReduce Output Part Files…
Exporting Data into Other Tools…
Using R to Determine a Threshold…

  data <- read.csv("thresh.tsv", sep='t', header=F)
  t_data <- data[,3]
  print(summary(t_data))

  # pass through values for 80+ percentile
  qntile <- .8
  t_thresh <- quantile(t_data, qntile)

  # CDF plot
  title <- "CDF threshold max(tfidf)"
  xtitle <- paste("thresh:", t_thresh)
  par(mfrow=c(2, 1))
  plot(ecdf(t_data), xlab=xtitle, main=title)
  abline(v=t_thresh, col="red")
  abline(h=qtile, col="yellow")

  # box-and-whisker plot
  boxplot(t_data, horizontal=TRUE)
  rug(t_data, side=1)
Using R to Determine a Threshold…
                            CDF threshold max(tfidf)
          0.8
  Fn(x)

          0.4
          0.0




                    0           2             4            6           8

                                    thresh: 0.063252




                0       1   2        3         4       5       6   7
Using Gephi to Explore the Social Graph…
Best Practices

• Again, there are much more efficient ways to handle Hadoop Streaming
  and Text Analytics…
• Unit Tests, Continuous Integration, etc., – all great stuff, but “Big Data”
  software engineering requires additional steps
• Sample data, measure data ratios and cluster behaviors, analyze in R,
  visualize everything you can, calibrate any necessary “magic numbers”
• Develop and test code on a personal computer in IDE, cmd line, etc., using
  a minimal data sets
• Deploy to staging cluster with larger data sets for integration tests and QA
• Run in production with A/B testing were feasible to evaluate changes
  quantitatively
• Learn from others at meetups, unconfs, forums, etc.
Great Resources for Diving into Hadoop

Google: Cluster Computing and MapReduce Lectures
    http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html


Amazon AWS Elastic MapReduce
    http://aws.amazon.com/elasticmapreduce/


Hadoop: The Definitive Guide, by Tom White
    http://oreilly.com/catalog/9780596521981


Apache Hadoop
    http://hadoop.apache.org/


Python “boto” interface to EMR
    http://boto.cloudhackers.com/emr_tut.html
Excellent Products for Hadoop in Production

Datameer
     http://www.datameer.com/
     “Democratizing Big Data”

Designed for business users, Datameer Analytics Solution (DAS) builds
on the power and scalability of Apache Hadoop to deliver an easy-to-use
and cost-effective solution for big data analytics. The solution integrates
rapidly with existing and new data sources to deliver sophisticated
analytics.


Cascading
     http://www.cascading.org/

Cascading is a feature-rich API for defining and executing complex,
scale-free, and fault tolerant data processing workflows on a Hadoop
cluster, which provides a thin Java library that sits on top of Hadoop's
MapReduce layer. Open source in Java.
Scale Unlimited – Hadoop Boot Camp

Santa Clara, 22-23 July 2010
http://www.scaleunlimited.com/courses/hadoop-bootcamp-santaclara

• An extensive overview of the Hadoop architecture
• Theory and practice of solving large scale data processing problems
• Hands-on labs covering Hadoop installation, development, debugging
• Common and advanced “Big Data” tasks and solutions
Special $500 discount for SVCC Meetup members:
http://hadoopbootcamp.eventbrite.com/?discount=DBDatameer


Sample material – list of questions about intro Hadoop from
the recent BigDataCamp:
http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp
Getting Started on Hadoop

Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
http://www.meetup.com/cloudcomputing/calendar/13911740/


Paco Nathan
@pacoid
http://ceteri.blogspot.com/


Examples of Hadoop Streaming, based on Python scripts
running on the AWS Elastic MapReduce service.

All source code for this talk is available at:

 http://github.com/ceteri/ceteri-mapred

Weitere ähnliche Inhalte

Was ist angesagt?

Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Python in big data world
Python in big data worldPython in big data world
Python in big data worldRohit
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterKevin Weil
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 

Was ist angesagt? (20)

Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
 
Geek camp
Geek campGeek camp
Geek camp
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 

Andere mochten auch

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Policy governance reviewed
Policy governance reviewedPolicy governance reviewed
Policy governance reviewedSuzanne Benoit
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...AIIM International
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryGlobalsion Software Sdn Bhd
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesChris Reynolds
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsScott Abel
 
MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiotspetewarden
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformationsswooledge
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementDATAVERSITY
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco Software
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions Alfresco Software
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1Jeff Potts
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project planDonna_Maree_Findlay
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST APIJ V
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 

Andere mochten auch (20)

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Policy governance reviewed
Policy governance reviewedPolicy governance reviewed
Policy governance reviewed
 
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
Why Are Change Management And Metrics Such Crucial Aspects To Your Overall De...
 
Bfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare IndustryBfit for healthcare - A Document Management System for Healthcare Industry
Bfit for healthcare - A Document Management System for Healthcare Industry
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
DMAvatar
DMAvatarDMAvatar
DMAvatar
 
Technology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance CompaniesTechnology Investment for Mutual Insurance Companies
Technology Investment for Mutual Insurance Companies
 
A Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your DocumentsA Practical Guide to Capturing, Organizing, and Securing Your Documents
A Practical Guide to Capturing, Organizing, and Securing Your Documents
 
MapReduce for Idiots
MapReduce for IdiotsMapReduce for Idiots
MapReduce for Idiots
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
 
The Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data ManagementThe Chief Data Officer Agenda: Metrics for Information and Data Management
The Chief Data Officer Agenda: Metrics for Information and Data Management
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture OverviewAlfresco As SharePoint Alternative - Architecture Overview
Alfresco As SharePoint Alternative - Architecture Overview
 
Scale your Alfresco Solutions
Scale your Alfresco Solutions Scale your Alfresco Solutions
Scale your Alfresco Solutions
 
Intro To Alfresco Part 1
Intro To Alfresco Part 1Intro To Alfresco Part 1
Intro To Alfresco Part 1
 
EDRMS Pre implementation project plan
EDRMS Pre implementation project planEDRMS Pre implementation project plan
EDRMS Pre implementation project plan
 
Alfresco 5.2 REST API
Alfresco 5.2 REST APIAlfresco 5.2 REST API
Alfresco 5.2 REST API
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 

Ähnlich wie Getting Started on Hadoop

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poliivascucristian
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"Nicola Ferraro
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREFernando Lopez Aguilar
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster InnardsMartin Dvorak
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapePaco Nathan
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 

Ähnlich wie Getting Started on Hadoop (20)

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
Big Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWAREBig Data and Machine Learning with FIWARE
Big Data and Machine Learning with FIWARE
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
How Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscapeHow Apache Spark fits in the Big Data landscape
How Apache Spark fits in the Big Data landscape
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

Mehr von Paco Nathan

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with MLPaco Nathan
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLPaco Nathan
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLPaco Nathan
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIPaco Nathan
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryPaco Nathan
 
Computable Content
Computable ContentComputable Content
Computable ContentPaco Nathan
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons LearnedPaco Nathan
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving UpPaco Nathan
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?Paco Nathan
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusPaco Nathan
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataPaco Nathan
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedPaco Nathan
 

Mehr von Paco Nathan (20)

Human in the loop: a design pattern for managing teams working with ML
Human in the loop: a design pattern for managing  teams working with MLHuman in the loop: a design pattern for managing  teams working with ML
Human in the loop: a design pattern for managing teams working with ML
 
Human-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage MLHuman-in-the-loop: a design pattern for managing teams that leverage ML
Human-in-the-loop: a design pattern for managing teams that leverage ML
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
Humans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AIHumans in a loop: Jupyter notebooks as a front-end for AI
Humans in a loop: Jupyter notebooks as a front-end for AI
 
Humans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industryHumans in the loop: AI in open source and industry
Humans in the loop: AI in open source and industry
 
Computable Content
Computable ContentComputable Content
Computable Content
 
Computable Content: Lessons Learned
Computable Content: Lessons LearnedComputable Content: Lessons Learned
Computable Content: Lessons Learned
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
Data Science in 2016: Moving Up
Data Science in 2016: Moving UpData Science in 2016: Moving Up
Data Science in 2016: Moving Up
 
Data Science Reinvents Learning?
Data Science Reinvents Learning?Data Science Reinvents Learning?
Data Science Reinvents Learning?
 
Jupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and ErasmusJupyter for Education: Beyond Gutenberg and Erasmus
Jupyter for Education: Beyond Gutenberg and Erasmus
 
GalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About DataGalvanizeU Seattle: Eleven Almost-Truisms About Data
GalvanizeU Seattle: Eleven Almost-Truisms About Data
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big DataApache Spark and the Emerging Technology Landscape for Big Data
Apache Spark and the Emerging Technology Landscape for Big Data
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
A New Year in Data Science: ML Unpaused
A New Year in Data Science: ML UnpausedA New Year in Data Science: ML Unpaused
A New Year in Data Science: ML Unpaused
 

Kürzlich hochgeladen

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 

Kürzlich hochgeladen (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 

Getting Started on Hadoop

  • 1. Getting Started on Hadoop Silicon Valley Cloud Computing Meetup Mountain View, 2010-07-19 http://www.meetup.com/cloudcomputing/calendar/13911740/ Paco Nathan @pacoid http://ceteri.blogspot.com/ Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service. • first, a brief history… • AWS Elastic MapReduce • “WordCount” example as “Hello World” for MapReduce • text mining Enron Email Dataset from Infochimps.com • inverted index, semantic lexicon, social graph • data visualization using R and Gephi All source code for this talk is available at: http://github.com/ceteri/ceteri-mapred
  • 2. How Does MapReduce Work? map(k1, v1) → list(k2, v2) reduce(k2, list(v2)) → list(v3) Several phases, which partition a problem into many tasks: • load data into DFS… • map phase: input split → (key, value) pairs, with optional combiner • shuffle phase: sort on keys to group pairs… load-test your network! • reduce phase: each task receives the values for one key • pull data from DFS… NB: “map” phase is required, the rest are optional. Think of set operations on tuples (and check out Cascading.org). Meanwhile, given all those (key, value) pairs listed above, it’s no wonder that key/value stores have become such a popular topic of conversation…
  • 3. How Does MapReduce Work? map(k1, v1) → list(k2, v2) reduce(k2, list(v2)) → list(v3) The property of data independence among tasks allows for highly parallel processing… maybe, if the stars are all aligned :) Primarily, a MapReduce framework is largely about fault tolerance, and how to leverage “commodity hardware” to replace “big iron” solutions… That phrase “big iron” might apply to Oracle + NetApp. Or perhaps an IBM zSeries mainframe… Or something – expensive, undoubtably. Bonus questions for self-admitted math geeks: Foresee any concerns about O(n) complexity, given the functional definitions listed above? Keep in mind that each phase cannot conclude and progress to the next phase until after each of its tasks has successfully completed.
  • 4. A Brief History… circa 1979 – Stanford, MIT, CMU, etc. set/list operations in LISP, Prolog, etc., for parallel processing http://www-formal.stanford.edu/jmc/history/lisp/lisp.htm circa 2004 – Google MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat http://labs.google.com/papers/mapreduce.html circa 2006 – Apache Hadoop, originating from the Nutch Project Doug Cutting http://research.yahoo.com/files/cutting.pdf circa 2008 – Yahoo web scale search indexing Hadoop Summit, HUG, etc. http://developer.yahoo.com/hadoop/ circa 2009 – Amazon AWS Elastic MapReduce Hadoop modified for EC2/S3, plus support for Hive, Pig, etc. http://aws.amazon.com/elasticmapreduce/
  • 5. Why run Hadoop in AWS? • elastic: batch jobs on clusters can consume many nodes, scalable demand, not 24/7 – great case for using EC2 • commodity hardware: MR is built for fault tolerance, great case for leveraging AMIs • right-sizing: difficult to know a priori how large of a cluster is needed – without running significant jobs (test k/v skew, data quality, etc.) • when your input data is already in S3, SDB, EBS, RDS… • when your output needs to be consumed in AWS … You really don't want to buy rack space in a datacenter before assessing these issues – besides, a private datacenter probably won’t even be cost-effective afterward.
  • 6. But why run Hadoop on Elastic MapReduce? • virtualization: Hadoop needs some mods to run well in that kind of environment • pay-per-drink: absorbs cost of launching nodes • secret sauce: Cluster Compute Instances (CCI) and Spot Instances (SI) • DevOps: EMR job flow mgmt optimizes where your staff spends their (limited) time+capital • logging to S3, works wonders for troubleshooting
  • 7. A Tale of Two Ventures… Adknowledge: in 2008, our team became one of the larger use cases running Hadoop on AWS • prior to the launch of EMR • launching clusters of up to 100 m1.xlarge • initially 12 hrs/day, optimized down to 4 hrs/day • displaced $3MM capex for Netezza ShareThis: in 2009, our team used even more Hadoop on AWS than that previous team • this time with EMR • larger/more frequent jobs • lower batch failure rate • faster turnaround on results • excellent support • smaller team required • much less budget
  • 8. “WordCount”, a “Hello World” for MapReduce Definition: count how often each word appears within a collection of text documents. A simple program which illustrates a pretty good test case for what MapReduce can perform, since it incorporates: • minimal amount code • document feature extraction (where words are “terms”) • symbolic and numeric values • potential use of a combiner • bipartite graph of (doc, term) tuples • not so many steps away from useful indexing… When a framework can run “WordCount” in parallel at scale, then it can handle much larger, more interesting compute problems as well.
  • 9. Bipartite Graph Wikipedia: “…a bipartite graph is a graph whose vertices can be divided into two disjoint sets U and V such that every edge connects a vertex in U to one in V… ” http://en.wikipedia.org/wiki/Bipartite_graph Consider the case where: U ≡ { documents } V ≡ { terms } Many kinds of text analytics products can be constructed based on this data structure as a foundation.
  • 10. “WordCount”, in other words… map(doc_id, text) → list(word, count) reduce(word, list(count)) → list(sum_count)
  • 11. “WordCount”, in other words… void map (String doc_id, String text): for each word w in segment(text): emitPartial(w, "1"); void reduce (String word, Iterator partial_counts): int count = 0; for each pc in partial_counts: count += Int(pc); emitResult(String(count));
  • 12. Hadoop Streaming One way to approach MapReduce jobs in Hadoop is to use streaming. In other words, use any kind of script which can be run from a command line and read/write data via stdin and stdout: http://hadoop.apache.org/common/docs/current/streaming.html#Hadoop+Streaming The following examples use Python scripts for Hadoop Streaming. One really great benefit is that then you can dev/test/debug your MapReduce code on small data sets from a command line simply by using pipes: cat input.txt | mapper.py | sort | reducer.py BTW, there are much better ways to handle Hadoop Streaming in Python on Elastic MapReduce – for example, using the “boto” library. However, these examples are kept simple so they’ll fit into a tech talk!
  • 15. “WordCount”, in other words… # this Linux command line... cat foo.txt | map_wc.py | sort | red_wc.py # produces output like this... tuple 9 term 6 tfidf 6 sort 5 analysis 2 wordcount 1 user 1 # depending on input - # which could be HTML content, tweets, email, etc.
  • 16. Speaking of Email… Enron pioneered innovative corporate accounting methods and energy market manipulations, involving a baffling array of fraud techniques. The firm soared to a valuation of over $60B (growing 56% in 1999, 87% in 2000) while inducing a state of emergency in California – which cost the state over $40B. Subsequent prosecution of top execs plus the meteoric decline in the firm’s 2001 share value made for a spectacular #EPIC #FAIL http://en.wikipedia.org/wiki/Enron_scandal http://en.wikipedia.org/wiki/California_electricity_crisis Thanks to CALO and Infochimps, we have a half million email messages collected from Enron managers during their, um, “heyday” period: http://infochimps.org/datasets/enron-email-dataset--2 http://www.cs.cmu.edu/~enron/ Let’s use Hadoop to help find out: what were some of the things those managers were talking about?
  • 17. Simple Text Analytics Extending from how “WordCount” works, we’ll add multiple kinds of output tuples, plus two stages of mappers and reducers, to generate different kinds of text analytics products: • inverted index • co-occurrence analysis • TF-IDF filter • social graph While doing that, we'll also perform other statistical analysis and data visualization using R and Gephi
  • 18. Mapper 1: RFC822 Parser map_parse.py takes a list of URI for where to read email messages, parses each message, then emits multiple kinds of output tuples: (doc_id, msg_uri, date) (sender, receiver, doc_id) (term, term_freq, doc_id) (term, co_term, doc_id) Note that our dataset includes approximately 500,000 email messages, with an average of about 100 words in each message. Also, there are 10E+5 unique terms. That will tend to be a constant in English texts, which is great to know when configuring capacity.
  • 19. Reducer 1: TF-IDF and Co-Occurrence red_idf.py takes the shuffled output from map_parse.py, collects metadata for each term, calculates TF-IDF to use in a later stage for filtering, calculates co-occurrence probability, then emits all these results: (doc_id, msg_uri, date) (sender, receiver, doc_id) (term, idf, count) (term, co_term, prob_cooc) (term, tfidf, doc_id) (term, max_tfidf)
  • 20. Mapper 1 + Reducer 1
  • 23. Mapper 2 + Reducer 2: Threshold Filter map_filter.py and red_filter.py apply a threshold (based on statistical analysis of TF-IDF) to filter results of co-occurrence analysis so that we begin to produce a semantic lexicon for exploring the data set. How do we determine a reasonable value for the TF-IDF threshold, for filtering terms? Sampling from the (term, max_tfidf) tuple, we run summary stats and visualization in R: cat dat.idf | util_extract.py m > thresh.tsv We also convert the sender/receiver social graph into CSV format for Gephi visualization: cat dat.parsed | util_extract.py s | util_gephi.py | sort -u > graph.csv
  • 24. Mapper 2 + Reducer 2
  • 33. Elastic MapReduce Output Part Files…
  • 34. Exporting Data into Other Tools…
  • 35. Using R to Determine a Threshold… data <- read.csv("thresh.tsv", sep='t', header=F) t_data <- data[,3] print(summary(t_data)) # pass through values for 80+ percentile qntile <- .8 t_thresh <- quantile(t_data, qntile) # CDF plot title <- "CDF threshold max(tfidf)" xtitle <- paste("thresh:", t_thresh) par(mfrow=c(2, 1)) plot(ecdf(t_data), xlab=xtitle, main=title) abline(v=t_thresh, col="red") abline(h=qtile, col="yellow") # box-and-whisker plot boxplot(t_data, horizontal=TRUE) rug(t_data, side=1)
  • 36. Using R to Determine a Threshold… CDF threshold max(tfidf) 0.8 Fn(x) 0.4 0.0 0 2 4 6 8 thresh: 0.063252 0 1 2 3 4 5 6 7
  • 37. Using Gephi to Explore the Social Graph…
  • 38. Best Practices • Again, there are much more efficient ways to handle Hadoop Streaming and Text Analytics… • Unit Tests, Continuous Integration, etc., – all great stuff, but “Big Data” software engineering requires additional steps • Sample data, measure data ratios and cluster behaviors, analyze in R, visualize everything you can, calibrate any necessary “magic numbers” • Develop and test code on a personal computer in IDE, cmd line, etc., using a minimal data sets • Deploy to staging cluster with larger data sets for integration tests and QA • Run in production with A/B testing were feasible to evaluate changes quantitatively • Learn from others at meetups, unconfs, forums, etc.
  • 39. Great Resources for Diving into Hadoop Google: Cluster Computing and MapReduce Lectures http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html Amazon AWS Elastic MapReduce http://aws.amazon.com/elasticmapreduce/ Hadoop: The Definitive Guide, by Tom White http://oreilly.com/catalog/9780596521981 Apache Hadoop http://hadoop.apache.org/ Python “boto” interface to EMR http://boto.cloudhackers.com/emr_tut.html
  • 40. Excellent Products for Hadoop in Production Datameer http://www.datameer.com/ “Democratizing Big Data” Designed for business users, Datameer Analytics Solution (DAS) builds on the power and scalability of Apache Hadoop to deliver an easy-to-use and cost-effective solution for big data analytics. The solution integrates rapidly with existing and new data sources to deliver sophisticated analytics. Cascading http://www.cascading.org/ Cascading is a feature-rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster, which provides a thin Java library that sits on top of Hadoop's MapReduce layer. Open source in Java.
  • 41. Scale Unlimited – Hadoop Boot Camp Santa Clara, 22-23 July 2010 http://www.scaleunlimited.com/courses/hadoop-bootcamp-santaclara • An extensive overview of the Hadoop architecture • Theory and practice of solving large scale data processing problems • Hands-on labs covering Hadoop installation, development, debugging • Common and advanced “Big Data” tasks and solutions Special $500 discount for SVCC Meetup members: http://hadoopbootcamp.eventbrite.com/?discount=DBDatameer Sample material – list of questions about intro Hadoop from the recent BigDataCamp: http://www.scaleunlimited.com/blog/intro-to-hadoop-at-bigdatacamp
  • 42. Getting Started on Hadoop Silicon Valley Cloud Computing Meetup Mountain View, 2010-07-19 http://www.meetup.com/cloudcomputing/calendar/13911740/ Paco Nathan @pacoid http://ceteri.blogspot.com/ Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service. All source code for this talk is available at: http://github.com/ceteri/ceteri-mapred

Hinweis der Redaktion