SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Data-Intensive Computing for Text Analysis
                 CS395T / INF385T / LIN386M
            University of Texas at Austin, Fall 2011



                      Lecture 4
                 September 15, 2011

        Jason Baldridge                      Matt Lease
   Department of Linguistics           School of Information
  University of Texas at Austin    University of Texas at Austin
Jasonbaldridge at gmail dot com   ml at ischool dot utexas dot edu
Acknowledgments
        Course design and slides based on
     Jimmy Lin’s cloud computing courses at
    the University of Maryland, College Park

Some figures courtesy of the following
excellent Hadoop books (order yours today!)
• Chuck Lam’s Hadoop In Action (2010)
• Tom White’s Hadoop: The Definitive Guide,
  2nd Edition (2010)
Today’s Agenda
• Practical Hadoop
  – Input/Ouput
  – Splits: small file and whole file operations
  – Compression
  – Mounting HDFS
  – Hadoop Workflow and EC2/S3
Practical Hadoop
“Hello World”: Word Count
map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer )

reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer)



    Map(String docid, String text):
      for each word w in text:
          Emit(w, 1);

    Reduce(String term, Iterator<Int> values):
      int sum = 0;
      for each v in values:
          sum += v;
      Emit(term, sum);
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17
Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49
Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 51
Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191
Command-Line Parsing




      Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135
Data Types in Hadoop

          Writable        Defines a de/serialization protocol.
                          Every data type in Hadoop is a Writable.


     WritableComparable Defines a sort order. All keys must be
                        of this type (but not values).




        IntWritable       Concrete classes for different data types.
        LongWritable
        Text
        …



       SequenceFiles      Binary encoded of a sequence of
                          key/value pairs
Hadoop basic types




             Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 46
Complex Data Types in Hadoop
   How do you implement complex data types?
   The easiest way:
       Encoded it as Text, e.g., (a, b) = “a:b”
       Use regular expressions to parse and extract data
       Works, but pretty hack-ish
   The hard way:
       Define a custom implementation of WritableComprable
       Must implement: readFields, write, compareTo
       Computationally efficient, but slow for rapid prototyping
   Alternatives:
       Cloud9 offers two other choices: Tuple and JSON
       (Actually, not that useful in practice)
InputFormat &RecordReader
                                     Courtesy of Tom White’s
                                     Hadoop: The Definitive Guide,
                                     2nd Edition (2010), pp. 198-199




                         Split is logical; atomic
                         records are never split




                        Note re-use key & value objects!
Courtesy of Tom White’s
Hadoop: The Definitive Guide,
2nd Edition (2010), p. 201
Input




        Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 53
Output




         Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 58
OutputFormat    Reducer        Reducer         Reduce




                                              RecordWriter   RecordWriter   RecordWriter




                                              Output File    Output File    Output File




Source: redrawn from a slide by Cloduera, cc-licensed
Creating Input Splits (White p. 202-203)




   FileInputFormat: large files split into blocks
       isSplitable() – default TRUE
       computeSplitSize() = max(minSize, min(maxSize,blockSize) )
       getSplits()…
   How to prevent splitting?
       Option 1: set mapred.min.splitsize=Long.MAX_VALUE
       Option 2: subclass FileInputFormat, set isSplitable()=FALSE
How to process whole file as a single record?

   e.g. file conversion


   Preventing splitting is necessary, but not sufficient
       Need a RecordReader that delivers entire file as a record


   Implement WholeFile input format & record reader recipe
       See White pp. 206-209
       Overrides getRecordReader() in FileInputFormat
       Defines new WholeFileRecordReader
Small Files
   Files < Hadoop block size are never split (by default)
       Note this is with default mapred.min.splitsize = 1 byte
       Could extend FileInputFormat to override this behavior
   Using many small files inefficient in Hadoop
       Overhead for TaskTracker, JobTracker, Map object, …
       Requires more disk seeks
       Wasteful for NameNode memory
   How to deal with small files??
Dealing with small files
    Pre-processing: merge into one or more bigger files
        Doubles disk space, unless clever (can delete after merge)
        Create Hadoop Archive (White pp. 72-73)
          • Doesn’t solve splitting problem, just reduces NameNode memory
        Simple text: just concatenate (e.g. each record on a single line)
        XML: concatenate, specify start/end tags
         StreamXmlRecordReader (as newline is end tag for Text)
        Create a SequenceFile (see White pp. 117-118)
          • Sequence of records, all with same (key,value) type
          • E.g. Key=filename, Value=text or bytes of original file
          • Can also use for larger files, e.g. if block processing is really fast
    Use CombineFileInputFormat
        Reduces map overhead, but not seeks or NameNode memory…
        Only an abstract class provided, you get to implement it… :-<
        Could use to speed up the pre-processing above…
Multiple File Formats?
    What if you have multiple formats for same content type?
    MultipleInputs (White pp. 214-215)
        Specify InputFormat & Mapper to use on a per-path basis
          • Path could be a directory or a single file
              • Even a single file could have many records (e.g. Hadoop archive or
                SequenceFile)
        All mappers must have the same output signature!
          • Same reducer used for all (only input format is different, not the
            logical records being processed by the different mappers)
    What about multiple file formats stored in the same
     Archive or SequenceFile?
    Multiple formats stored in the same directory?
    How are multiple file types typically handled in general?
        e.g. factory pattern, White p. 80
White 77-86, Lam 153-155

Data Compression
   Big data = big disk space & I/O (bound) transfer times
       Affects both intermediate (mapper output) and persistent data
   Compression makes big data less big (but still cool)
       Often 1/4th size of original data
   Main issues
       Does the compression format support splitting?
         • What happens to parallelization if an entire 8GB compressed file has
           to be decompressed before we can access the splits?
       Compression/decompression ratio vs. speed
         • More compression reduces disk space and transfer times, but…
         • Slow compression can take longer than reduced transfer time savings
         • Use native libraries!
Courtesy of Tom White’s
                           Hadoop: The Definitive Guide,
                           2nd Edition (2010), Ch. 4




Slow; decompression can’t keep pace disk reads
Compression Speed
      LZO 2x faster than gzip
      LZO ~15-20x faster than bzip2
http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/




 http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html
Splittable LZO to the rescue
   LZO format not internally splittable, but we can create
    a separate, accompanying index of split points
Recipe
   Get LZO from Cloudera or elsewhere, and setup
       See URL on last slide for instructions
   LZO compress files, copy to HDFS at /path
   Index them: $ hadoop jar /path/to/hadoop-lzo.jar
    com.hadoop.compression.lzo.LzoIndexer /path
   Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat
   Voila!
Compression API for persistent data
   JobConf helper functions –or– set properties
   Input
       conf.setInputFormatClass(LzoTextInputFormat.class);
   Persistent (reducer) output
       FileOutputFormat.setCompressOutput(conf, true)
       FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class)




                                                              Courtesy of Tom White’s
                                                              Hadoop: The Definitive Guide,
                                                              2nd Edition (2010), p. 85
Compression API for intermediate data
    Similar JobConf helper functions –or– set properties
        conf.setCompressMapOutput()
        Conf.setMapOutputCompressClass(LzopCodec.class)




                                                     Courtesy of Chuck Lam’s
                                                     Hadoop In Action(2010),
                                                     pp. 153-155
SequenceFile & compression
   Use SequenceFile for passing data between Hadoop jobs
       Optimized for this usage case
       conf.setOutputFormat(SequenceFileOutputFormat.class)
   With compression, one more parameter to set
       Default compression per-record; almost always preferable to
        compress on a per-block basis
From “hadoop fs X” -> Mounted HDFS




                       See White p. 50;
                       hadoop: src/contrib/fuse-dfs
Hadoop Workflow

                                        1. Load data into HDFS


    2. Develop code locally



                   3. Submit MapReduce job
                   3a. Go back to Step 2
                                                Hadoop Cluster
    You


                                        4. Retrieve data from HDFS
On Amazon: With EC2
                                          0. Allocate Hadoop cluster
                                          1. Load data into HDFS

                                                        EC2
      2. Develop code locally



                     3. Submit MapReduce job
                     3a. Go back to Step 2
      You                                       Your Hadoop Cluster


                                          4. Retrieve data from HDFS
                                          5. Clean up!


Uh oh. Where did the data go?
On Amazon: EC2 and S3

                          Copy from S3 to HDFS

   EC2                                                  S3
 (The Cloud)                                     (Persistent Store)




        Your Hadoop Cluster




                          Copy from HFDS to S3

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
Adam Kawa
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 

Was ist angesagt? (20)

SQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - HadoopSQLRally Amsterdam 2013 - Hadoop
SQLRally Amsterdam 2013 - Hadoop
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍20141111 파이썬으로 Hadoop MR프로그래밍
20141111 파이썬으로 Hadoop MR프로그래밍
 
Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm Data model for analysis of scholarly documents in the MapReduce paradigm
Data model for analysis of scholarly documents in the MapReduce paradigm
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013Parquet - Data I/O - Philadelphia 2013
Parquet - Data I/O - Philadelphia 2013
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 

Ähnlich wie Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
guest27e6764
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Ähnlich wie Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011) (20)

Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 
final report
final reportfinal report
final report
 
Hadoop compression strata conference
Hadoop compression strata conferenceHadoop compression strata conference
Hadoop compression strata conference
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Large Scale Data With Hadoop
Large Scale Data With HadoopLarge Scale Data With Hadoop
Large Scale Data With Hadoop
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
Apache Hadoop Big Data Technology
Apache Hadoop Big Data TechnologyApache Hadoop Big Data Technology
Apache Hadoop Big Data Technology
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 

Mehr von Matthew Lease

The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
Matthew Lease
 

Mehr von Matthew Lease (20)

Automated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey ResponsesAutomated Models for Quantifying Centrality of Survey Responses
Automated Models for Quantifying Centrality of Survey Responses
 
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
Key Challenges in Moderating Social Media: Accuracy, Cost, Scalability, and S...
 
Explainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loopExplainable Fact Checking with Humans in-the-loop
Explainable Fact Checking with Humans in-the-loop
 
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
Adventures in Crowdsourcing : Toward Safer Content Moderation & Better Suppor...
 
AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd AI & Work, with Transparency & the Crowd
AI & Work, with Transparency & the Crowd
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
Designing at the Intersection of HCI & AI: Misinformation & Crowdsourced Anno...
 
But Who Protects the Moderators?
But Who Protects the Moderators?But Who Protects the Moderators?
But Who Protects the Moderators?
 
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
Believe it or not: Designing a Human-AI Partnership for Mixed-Initiative Fact...
 
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
Mix and Match: Collaborative Expert-Crowd Judging for Building Test Collectio...
 
Fact Checking & Information Retrieval
Fact Checking & Information RetrievalFact Checking & Information Retrieval
Fact Checking & Information Retrieval
 
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
Your Behavior Signals Your Reliability: Modeling Crowd Behavioral Traces to E...
 
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
What Can Machine Learning & Crowdsourcing Do for You? Exploring New Tools for...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Systematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s ClothingSystematic Review is e-Discovery in Doctor’s Clothing
Systematic Review is e-Discovery in Doctor’s Clothing
 
The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)The Rise of Crowd Computing (July 7, 2016)
The Rise of Crowd Computing (July 7, 2016)
 
The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016The Rise of Crowd Computing - 2016
The Rise of Crowd Computing - 2016
 
The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)The Rise of Crowd Computing (December 2015)
The Rise of Crowd Computing (December 2015)
 
Toward Better Crowdsourcing Science
 Toward Better Crowdsourcing Science Toward Better Crowdsourcing Science
Toward Better Crowdsourcing Science
 
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work PlatformsBeyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
Beyond Mechanical Turk: An Analysis of Paid Crowd Work Platforms
 

Kürzlich hochgeladen

Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
kumarajju5765
 
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
nirzagarg
 
Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...
Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...
Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...
amitlee9823
 
Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...
Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...
Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...
amitlee9823
 
Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...
Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...
Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...
amitlee9823
 
ELECTRICITÉ TMT 55.pdf electrick diagram manitout
ELECTRICITÉ TMT 55.pdf electrick diagram manitoutELECTRICITÉ TMT 55.pdf electrick diagram manitout
ELECTRICITÉ TMT 55.pdf electrick diagram manitout
ssjews46
 
Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
amitlee9823
 
Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...
amitlee9823
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Aggregage
 
Tata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tataTata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tata
aritradey27234
 
Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...
amitlee9823
 
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
amitlee9823
 
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
ozave
 

Kürzlich hochgeladen (20)

How To Fix Mercedes Benz Anti-Theft Protection Activation Issue
How To Fix Mercedes Benz Anti-Theft Protection Activation IssueHow To Fix Mercedes Benz Anti-Theft Protection Activation Issue
How To Fix Mercedes Benz Anti-Theft Protection Activation Issue
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
 
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men  🔝narsinghpur🔝  ...
➥🔝 7737669865 🔝▻ narsinghpur Call-girls in Women Seeking Men 🔝narsinghpur🔝 ...
 
Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...
Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...
Bangalore Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore E...
 
Is Your BMW PDC Malfunctioning Discover How to Easily Reset It
Is Your BMW PDC Malfunctioning Discover How to Easily Reset ItIs Your BMW PDC Malfunctioning Discover How to Easily Reset It
Is Your BMW PDC Malfunctioning Discover How to Easily Reset It
 
Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...
Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...
Top Rated Call Girls South Mumbai : 9920725232 We offer Beautiful and sexy Ca...
 
Why Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So LoudWhy Does My Porsche Cayenne's Exhaust Sound So Loud
Why Does My Porsche Cayenne's Exhaust Sound So Loud
 
Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...
Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...
Vip Mumbai Call Girls Navi Mumbai Call On 9920725232 With Body to body massag...
 
(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7
(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7
(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7
 
ELECTRICITÉ TMT 55.pdf electrick diagram manitout
ELECTRICITÉ TMT 55.pdf electrick diagram manitoutELECTRICITÉ TMT 55.pdf electrick diagram manitout
ELECTRICITÉ TMT 55.pdf electrick diagram manitout
 
John Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair ManualJohn Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair Manual
 
Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Kanakapura Road Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Mumbai Call On 9920725232 With Body to body massage wit...
 
Stay Cool and Compliant: Know Your Window Tint Laws Before You Tint
Stay Cool and Compliant: Know Your Window Tint Laws Before You TintStay Cool and Compliant: Know Your Window Tint Laws Before You Tint
Stay Cool and Compliant: Know Your Window Tint Laws Before You Tint
 
(INDIRA) Call Girl Surat Call Now 8250077686 Surat Escorts 24x7
(INDIRA) Call Girl Surat Call Now 8250077686 Surat Escorts 24x7(INDIRA) Call Girl Surat Call Now 8250077686 Surat Escorts 24x7
(INDIRA) Call Girl Surat Call Now 8250077686 Surat Escorts 24x7
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
 
Tata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tataTata_Nexon_brochure tata nexon brochure tata
Tata_Nexon_brochure tata nexon brochure tata
 
Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Colaba Call On 9920725232 With Body to body massage wit...
 
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
 
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
 

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

  • 1. Data-Intensive Computing for Text Analysis CS395T / INF385T / LIN386M University of Texas at Austin, Fall 2011 Lecture 4 September 15, 2011 Jason Baldridge Matt Lease Department of Linguistics School of Information University of Texas at Austin University of Texas at Austin Jasonbaldridge at gmail dot com ml at ischool dot utexas dot edu
  • 2. Acknowledgments Course design and slides based on Jimmy Lin’s cloud computing courses at the University of Maryland, College Park Some figures courtesy of the following excellent Hadoop books (order yours today!) • Chuck Lam’s Hadoop In Action (2010) • Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010)
  • 3. Today’s Agenda • Practical Hadoop – Input/Ouput – Splits: small file and whole file operations – Compression – Mounting HDFS – Hadoop Workflow and EC2/S3
  • 5. “Hello World”: Word Count map ( K( K1=String, V1=String ) → list ( K2=String, V2=Integer ) reduce ( K2=String, list(V2=Integer) ) → list ( K3=String, V3=Integer) Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, sum);
  • 6. Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 17
  • 7. Courtesy of Chuck Lam’s Hadoop In Action (2010), pp. 48-49
  • 8. Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 51
  • 9. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 191
  • 10. Command-Line Parsing Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 135
  • 11. Data Types in Hadoop Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComparable Defines a sort order. All keys must be of this type (but not values). IntWritable Concrete classes for different data types. LongWritable Text … SequenceFiles Binary encoded of a sequence of key/value pairs
  • 12. Hadoop basic types Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 46
  • 13. Complex Data Types in Hadoop  How do you implement complex data types?  The easiest way:  Encoded it as Text, e.g., (a, b) = “a:b”  Use regular expressions to parse and extract data  Works, but pretty hack-ish  The hard way:  Define a custom implementation of WritableComprable  Must implement: readFields, write, compareTo  Computationally efficient, but slow for rapid prototyping  Alternatives:  Cloud9 offers two other choices: Tuple and JSON  (Actually, not that useful in practice)
  • 14. InputFormat &RecordReader Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), pp. 198-199 Split is logical; atomic records are never split Note re-use key & value objects!
  • 15. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 201
  • 16. Input Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 53
  • 17. Output Courtesy of Chuck Lam’s Hadoop In Action (2010), p. 58
  • 18. OutputFormat Reducer Reducer Reduce RecordWriter RecordWriter RecordWriter Output File Output File Output File Source: redrawn from a slide by Cloduera, cc-licensed
  • 19. Creating Input Splits (White p. 202-203)  FileInputFormat: large files split into blocks  isSplitable() – default TRUE  computeSplitSize() = max(minSize, min(maxSize,blockSize) )  getSplits()…  How to prevent splitting?  Option 1: set mapred.min.splitsize=Long.MAX_VALUE  Option 2: subclass FileInputFormat, set isSplitable()=FALSE
  • 20. How to process whole file as a single record?  e.g. file conversion  Preventing splitting is necessary, but not sufficient  Need a RecordReader that delivers entire file as a record  Implement WholeFile input format & record reader recipe  See White pp. 206-209  Overrides getRecordReader() in FileInputFormat  Defines new WholeFileRecordReader
  • 21. Small Files  Files < Hadoop block size are never split (by default)  Note this is with default mapred.min.splitsize = 1 byte  Could extend FileInputFormat to override this behavior  Using many small files inefficient in Hadoop  Overhead for TaskTracker, JobTracker, Map object, …  Requires more disk seeks  Wasteful for NameNode memory  How to deal with small files??
  • 22. Dealing with small files  Pre-processing: merge into one or more bigger files  Doubles disk space, unless clever (can delete after merge)  Create Hadoop Archive (White pp. 72-73) • Doesn’t solve splitting problem, just reduces NameNode memory  Simple text: just concatenate (e.g. each record on a single line)  XML: concatenate, specify start/end tags StreamXmlRecordReader (as newline is end tag for Text)  Create a SequenceFile (see White pp. 117-118) • Sequence of records, all with same (key,value) type • E.g. Key=filename, Value=text or bytes of original file • Can also use for larger files, e.g. if block processing is really fast  Use CombineFileInputFormat  Reduces map overhead, but not seeks or NameNode memory…  Only an abstract class provided, you get to implement it… :-<  Could use to speed up the pre-processing above…
  • 23. Multiple File Formats?  What if you have multiple formats for same content type?  MultipleInputs (White pp. 214-215)  Specify InputFormat & Mapper to use on a per-path basis • Path could be a directory or a single file • Even a single file could have many records (e.g. Hadoop archive or SequenceFile)  All mappers must have the same output signature! • Same reducer used for all (only input format is different, not the logical records being processed by the different mappers)  What about multiple file formats stored in the same Archive or SequenceFile?  Multiple formats stored in the same directory?  How are multiple file types typically handled in general?  e.g. factory pattern, White p. 80
  • 24. White 77-86, Lam 153-155 Data Compression  Big data = big disk space & I/O (bound) transfer times  Affects both intermediate (mapper output) and persistent data  Compression makes big data less big (but still cool)  Often 1/4th size of original data  Main issues  Does the compression format support splitting? • What happens to parallelization if an entire 8GB compressed file has to be decompressed before we can access the splits?  Compression/decompression ratio vs. speed • More compression reduces disk space and transfer times, but… • Slow compression can take longer than reduced transfer time savings • Use native libraries!
  • 25. Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), Ch. 4 Slow; decompression can’t keep pace disk reads
  • 26. Compression Speed  LZO 2x faster than gzip  LZO ~15-20x faster than bzip2 http://www.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/ http://arunxjacob.blogspot.com/2011/04/rolling-out-splittable-lzo-on-cdh3.html
  • 27. Splittable LZO to the rescue  LZO format not internally splittable, but we can create a separate, accompanying index of split points Recipe  Get LZO from Cloudera or elsewhere, and setup  See URL on last slide for instructions  LZO compress files, copy to HDFS at /path  Index them: $ hadoop jar /path/to/hadoop-lzo.jar com.hadoop.compression.lzo.LzoIndexer /path  Use hadoop-lzo’s LzoTextInputFormat instead of TextInputFormat  Voila!
  • 28. Compression API for persistent data  JobConf helper functions –or– set properties  Input  conf.setInputFormatClass(LzoTextInputFormat.class);  Persistent (reducer) output  FileOutputFormat.setCompressOutput(conf, true)  FileOutputFormat.setOutputCompressorClass(conf, LzopCodec.class) Courtesy of Tom White’s Hadoop: The Definitive Guide, 2nd Edition (2010), p. 85
  • 29. Compression API for intermediate data  Similar JobConf helper functions –or– set properties  conf.setCompressMapOutput()  Conf.setMapOutputCompressClass(LzopCodec.class) Courtesy of Chuck Lam’s Hadoop In Action(2010), pp. 153-155
  • 30. SequenceFile & compression  Use SequenceFile for passing data between Hadoop jobs  Optimized for this usage case  conf.setOutputFormat(SequenceFileOutputFormat.class)  With compression, one more parameter to set  Default compression per-record; almost always preferable to compress on a per-block basis
  • 31. From “hadoop fs X” -> Mounted HDFS See White p. 50; hadoop: src/contrib/fuse-dfs
  • 32. Hadoop Workflow 1. Load data into HDFS 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 Hadoop Cluster You 4. Retrieve data from HDFS
  • 33. On Amazon: With EC2 0. Allocate Hadoop cluster 1. Load data into HDFS EC2 2. Develop code locally 3. Submit MapReduce job 3a. Go back to Step 2 You Your Hadoop Cluster 4. Retrieve data from HDFS 5. Clean up! Uh oh. Where did the data go?
  • 34. On Amazon: EC2 and S3 Copy from S3 to HDFS EC2 S3 (The Cloud) (Persistent Store) Your Hadoop Cluster Copy from HFDS to S3