SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
Introduction to
                        Hadoop/HBase
                             Shawn Hermans
                          Omaha Java Users Group
                            March 19th, 2013




Tuesday, March 26, 13
About Me

                    • Mathematician/Physicist turned Consultant
                    • Graduate Student in CS at UNO
                    • Current Software Engineer at Sojern


Tuesday, March 26, 13
Introduction to Hadoop




Tuesday, March 26, 13
What is Hadoop?
            Hue                 Mahout           Zookeeper
                        Oozie            HBase
      Whirr                                           HDFS
             An Ecosystem of Open Source Tools
              for Handling Big Data Problems on
                    Commodity Hardware
                                                  Hive
     MapReduce
                 Avro
                           Pig               HCatalog
         Sqoop                   MapReduce
                    Flume           YARN


Tuesday, March 26, 13
First Impression




Tuesday, March 26, 13
High Level
      • HDFS - Distributed
              File System
      • MapReduce -
              Scalable, Distributed
              Batch Processing
      • HBase - Low
              Latency, Random
              Access to Large
              Amounts of Data
                               Retrieved from: http://en.wikipedia.org/wiki/Apache_Hadoop



Tuesday, March 26, 13
Hadoop Use Cases

                    • Extract, Transform, Load     Hadoop is
                    • Data Mining                 Well Suited
                                                 for “Big Data”
                    • Data Warehousing             Use Cases

                    • Analytics

Tuesday, March 26, 13
What is Big Data?
                  Data Source       Size
                                                   Gigabytes -
   Wikipedia Database Dump          9GB         Normal size for relational
                                                      databases
             Open Street Map       19GB
                                                    Terabytes -
                                                 Relational databases may
              Common Crawl         81TB         start to experience scaling
                                                           issues
               1000 Genomes        200TB
                                                    Petabytes -
                                                   Relational databases
       Large Hadron Collider    15PB annually   struggle to scale without a
                                                     lot of fine tuning




Tuesday, March 26, 13
RDBMS vs Hadoop
           Organization          Storage
                   EBay          1.4PB
                        IRS      150TB
                                                   Large Postgres Installs
                  Yahoo           2PB

                                           Organization      Nodes       Storage
                                            Quantcast     3,000 cores     3PB
      Large Hadoop Installs
                                              Yahoo       4,000 nodes     70PB
                                            Facebook      2,300+ nodes   100PB+



Tuesday, March 26, 13
HDFS




Tuesday, March 26, 13
HDFS
             • Optimized for high
                    throughput, not low
                    latency
             • Not POSIX
                    compliant
             • Not optimized for
                    small files and
                    random reads


Tuesday, March 26, 13
HDFS Examples
 hdfs dfs -mkdir hdfs://nn.example.com/tmp/foo


 hdfs dfs -rmr hdfs://nn.example.com/tmp/foo


 hdfs dfs -get hdfs://nn.example.com/tmp/foo /tmp/bar


  hdfs dfs -ls          hdfs://nn.example.com/tmp/foo
  -rw-r--r--            3 dev supergroup          0 2013-03-12 15:28 hdfs://../_SUCCESS
  drwxr-xr-x            - dev supergroup          0 2013-03-12 15:27 hdfs://../_logs
  -rw-r--r--            3 dev supergroup   46812001 2013-03-12 15:28 hdfs://../part-r-00000




Tuesday, March 26, 13
Common File Formats
                    • Avro
                    • SequenceFile
                    • RCFile
                    • Protocol Buffers
                    • Thrift
                    • CSV/TSV
Tuesday, March 26, 13
Common Compression
                  • Snappy
                  • LZO
                  • Gzip
                  • Bzip


Tuesday, March 26, 13
MapReduce




Tuesday, March 26, 13
MapReduce
                                                                                          •        Map - Emit key/
                                                                                                   value pairs from
                                                                                                   data

                                                                                          •        Reduce - Collect
                                                                                                   data with common
                                                                                                   keys

                                                                                          •        Tries to minimize
                                                                                                   moving data
                                                                                                   between nodes

                        Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview




Tuesday, March 26, 13
Word Count

                    • Given a corpus of documents of
                        documents, count the number of times
                        each word occurs
                    • The “Hello World” of MapReduce


Tuesday, March 26, 13
Mapper
    public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
      18     private final static IntWritable one = new IntWritable(1);
      19     private Text word = new Text();
      20
      21     public void map(LongWritable key, Text value, Context context) throws
    IOException, InterruptedException {
      22         String line = value.toString();
      23         StringTokenizer tokenizer = new StringTokenizer(line);
      24         while (tokenizer.hasMoreTokens()) {
      25             word.set(tokenizer.nextToken());
      26             context.write(word, one);
      27         }
      28     }
      29 }




                                Taken from http://wiki.apache.org/hadoop/WordCount




Tuesday, March 26, 13
Reducer
          31 public static class Reduce extends Reducer<Text, IntWritable,
         Text, IntWritable> {
           32
           33     public void reduce(Text key, Iterable<IntWritable> values,
         Context context)
           34       throws IOException, InterruptedException {
           35         int sum = 0;
           36         for (IntWritable val : values) {
           37             sum += val.get();
           38         }
           39         context.write(key, new IntWritable(sum));
           40     }
           41 }




                                Taken from http://wiki.apache.org/hadoop/WordCount



Tuesday, March 26, 13
Program
               43 public static void main(String[] args) throws Exception {
                44    Configuration conf = new Configuration();
                45
                46        Job job = new Job(conf, "wordcount");
                47
                48    job.setOutputKeyClass(Text.class);
                49    job.setOutputValueClass(IntWritable.class);
                50
                51    job.setMapperClass(Map.class);
                52    job.setReducerClass(Reduce.class);
                53
                54    job.setInputFormatClass(TextInputFormat.class);
                55    job.setOutputFormatClass(TextOutputFormat.class);
                56
                57    FileInputFormat.addInputPath(job, new Path(args[0]));
                58    FileOutputFormat.setOutputPath(job, new Path(args[1]));
                59
                60    job.waitForCompletion(true);
                61 }


                                      Taken from http://wiki.apache.org/hadoop/WordCount



Tuesday, March 26, 13
Problem With MR

                    • Cumbersome API
                    • Too Low Level for Most Programs
                    • Requires Writing Java (Unfamiliar to many
                        data analysts)




Tuesday, March 26, 13
MapReduce Alternatives
                    • Hive - SQL-like query language for Hadoop
                    • Pig - Procedural, high-level language for
                        creating Hadoop workflows
                    • Cascading - Java API for creating creating
                        complex MapReduce workflows
                    • Cascalog - Clojure-based query language

Tuesday, March 26, 13
Hive

        SELECT	
  word,	
  COUNT(*)	
  FROM	
  input	
  LATERAL	
  VIEW	
  
        explode(split(text,	
  '	
  '))	
  lTable	
  as	
  word	
  GROUP	
  
        BY	
  word;




                           Taken from: http://stackoverflow.com/questions/10039949/word-count-program-in-hive



Tuesday, March 26, 13
Pig
       input_lines	
  =	
  LOAD	
  '/tmp/the-­‐internet'	
  AS	
  (line:chararray);
       words	
  =	
  FOREACH	
  input_lines	
  GENERATE	
  FLATTEN(TOKENIZE(line))	
  AS	
  word;
       word_groups	
  =	
  GROUP	
  words	
  BY	
  word;
       word_count	
  =	
  FOREACH	
  word_groups	
  group,	
  GENERATE	
  COUNT(words);
       STORE	
  ordered_word_count	
  INTO	
  '/tmp/words-­‐counted';




                                       Based on: http://en.wikipedia.org/wiki/Pig_(programming_tool)



Tuesday, March 26, 13
Cascading
                        Scheme	
  sourceScheme	
  =	
  new	
  TextLine(	
  new	
  Fields(	
  "line"	
  )	
  );
                        Tap	
  source	
  =	
  new	
  Hfs(	
  sourceScheme,	
  inputPath	
  );
                        	
  
                        Scheme	
  sinkScheme	
  =	
  new	
  TextLine(	
  new	
  Fields(	
  "word",	
  "count"	
  )	
  );
                        Tap	
  sink	
  =	
  new	
  Hfs(	
  sinkScheme,	
  outputPath,	
  SinkMode.REPLACE	
  );
                        Pipe	
  assembly	
  =	
  new	
  Pipe(	
  "wordcount"	
  );
                        String	
  regex	
  =	
  "(?<!pL)(?=pL)[^	
  ]*(?<=pL)(?!pL)";
                        Function	
  function	
  =	
  new	
  RegexGenerator(	
  new	
  Fields(	
  "word"	
  ),	
  regex	
  );
                        assembly	
  =	
  new	
  Each(	
  assembly,	
  new	
  Fields(	
  "line"	
  ),	
  function	
  );
                        assembly	
  =	
  new	
  GroupBy(	
  assembly,	
  new	
  Fields(	
  "word"	
  )	
  );
                        Aggregator	
  count	
  =	
  new	
  Count(	
  new	
  Fields(	
  "count"	
  )	
  );
                        assembly	
  =	
  new	
  Every(	
  assembly,	
  count	
  );
                        	
  
                        Properties	
  properties	
  =	
  new	
  Properties();
                        FlowConnector.setApplicationJarClass(	
  properties,	
  Main.class	
  );
                        FlowConnector	
  flowConnector	
  =	
  new	
  FlowConnector(	
  properties	
  );
                        Flow	
  flow	
  =	
  flowConnector.connect(	
  "word-­‐count",	
  source,	
  sink,	
  
                        assembly	
  );
                        flow.complete();




                                                    Taken from: http://docs.cascading.org/cascading/1.2/userguide/html/ch02.html


Tuesday, March 26, 13
PyCascading
      from	
  pycascading.helpers	
  import	
  *

      @udf_map(produces=['word'])
      def	
  split_words(tuple):
      	
  	
  	
  	
  for	
  word	
  in	
  tuple.get(1).split():
      	
  	
  	
  	
  	
  	
  	
  	
  yield	
  [word]

      def	
  main():
      	
  	
  	
  	
  flow	
  =	
  Flow()

      	
  	
  	
  	
  input	
  =	
  flow.source(Hfs(TextLine(),	
  'pycascading_data/town.txt'))
      	
  	
  	
  	
  output	
  =	
  flow.tsv_sink('pycascading_data/out')

      	
  	
  	
  	
  input	
  |	
  split_words	
  |	
  group_by('word',	
  native.count())	
  |	
  output

      	
  	
  	
  	
  flow.run(num_reducers=2)




                                            From: https://github.com/twitter/pycascading/blob/master/examples/word_count.py




Tuesday, March 26, 13
Cascalog
    (defn	
  lowercase	
  [w]	
  (.toLowerCase	
  w))
    	
  
    (?<-­‐	
  (stdout)	
  [?word	
  ?count]	
  
    	
  	
  (sentence	
  ?s)	
  (split	
  ?s	
  :>	
  ?word1)
    	
  	
  (lowercase	
  ?word1	
  :>	
  ?word)	
  (c/count	
  ?count))




                        Taken from: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-haado.html




Tuesday, March 26, 13
HBase




Tuesday, March 26, 13
HBase
                    • NoSQL
                    • Distributed
                    • Versioned
                    • Timestamped
                    • Columnar

Tuesday, March 26, 13
HBase Use Cases

                    • Web Crawl Table
                    • Large Scale Web Messaging
                    • Sensor Data Collection
                    • Analytics

Tuesday, March 26, 13
HBase Architecture
                    • Rows are split across regions
                    • Each region is hosted by a single
                        RegionServer
                    • Reads and writes to a single row are
                        atomic
                    • Row is a key/value pair container

Tuesday, March 26, 13
HBase Data Model
      (rowkey, column family, column, timestamp) -> value

       • Column Families Control Physical Storage
        • TTL
        • Versions
        • Compression
        • Bloom Filters

Tuesday, March 26, 13
HBase vs RDBMS
                    • Horizontal Scaling
                    • Atomic Across Single Row Only
                    • No JOINs
                    • No Indexes
                    • No Data Types
                    • Normalization = Bad in HBase
Tuesday, March 26, 13
HBase vs HDFS

            Batch Processing        HBase is 4 to 5 times slower than HDFS

             Random Access     HBase uses techniques to reduce read/write latency

               Small Records      Normal HDFS Block size is 128MB or larger

                # of Records      # of HDFS files limited by NameNode RAM




Tuesday, March 26, 13
Create Table
   hbase(main):001:0>
   hbase(main):001:0> create ‘table1’, {NAME => ‘f1’, VERSIONS => 1, TTL => 2592000},{NAME => ‘f2’}, {NAME
   => ‘f3’}
   0 row(s) in 1.5700 seconds
   hbase(main):002:0> describe ‘table1’ DESCRIPTION ENABLED
   {NAME => ‘table1’, FAMILIES => [{NAME => ‘f1’, BLOOMFILTER => ‘NONE’, REPLICATION_SCOPE => ‘0’,
   VERSIONS => ‘1’, COMPRESSION => ‘N true ONE’, MIN_VERSIONS => ‘0’, TTL => ‘2592000’, BLOCKSIZE =>
   ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘true’}, {NAME => ‘f2’, BLOOMFILTER => ‘NONE’,
   REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’, COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL =>
   ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘true’}, {NAME => ‘f3’,
   BLOOMFILTER => ‘NONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’, COMPRESSION => ‘NONE’,
   MIN_VERSIONS => ‘0’, TTL => ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCK CACHE
   => ‘true’}]}
   1 row(s) in 0.0170 seconds
   hbase(main):003:0>




Tuesday, March 26, 13
Get
           Configuration	
  conf	
  =	
  HBaseConfiguration.create();
           HTable	
  table	
  =	
  new	
  HTable(conf,	
  "testtable");
           Get	
  get	
  =	
  new	
  Get(Bytes.toBytes("row1"));	
  
           get.addColumn(Bytes.toBytes("colfam1"),	
  
           Bytes.toBytes("qual1"));	
  
           Result	
  result	
  =	
  table.get(get);
           byte[]	
  val	
  =	
  result.getValue(Bytes.toBytes("colfam1"),
           Bytes.toBytes("qual1"));	
  System.out.println("Value:	
  "	
  +	
  
           Bytes.toString(val));




                                         Example from HBase: The Definitive Guide



Tuesday, March 26, 13
Put
      Configuration	
  conf	
  =	
  HBaseConfiguration.create();
      HTable	
  table	
  =	
  new	
  HTable(conf,	
  "testtable");
      Put	
  put	
  =	
  new	
  Put(Bytes.toBytes("row1"));
      put.add(Bytes.toBytes("colfam1"),	
  Bytes.toBytes("qual1"),	
  Bytes.toBytes("val1"));
      put.add(Bytes.toBytes("colfam1"),	
  Bytes.toBytes("qual2"),	
  Bytes.toBytes("val2"));
      table.put(put);




                                         Example from HBase: The Definitive Guide



Tuesday, March 26, 13
Scan
       Scan	
  scan	
  =	
  new	
  Scan();
       scan.addColumn(Bytes.toBytes("colfam1"),	
  
       Bytes.toBytes("col-­‐5")).
       	
  	
  addColumn(Bytes.toBytes("colfam2"),	
  
       Bytes.toBytes("col-­‐33")).	
  
       	
  	
  setStartRow(Bytes.toBytes("row-­‐10")).	
  
       	
  	
  setStopRow(Bytes.toBytes("row-­‐20"));
       ResultScanner	
  scanner	
  =	
  table.getScanner(scan);	
  
       for	
  (Result	
  res	
  :	
  scanner)	
  {
       	
  	
  System.out.println(res);	
  
       }
       scanner.close();


                                  Example from HBase: The Definitive Guide



Tuesday, March 26, 13
HBase Schemas
                    • Column Families control physical storage
                    • Prefer short names for columns
                    • Use “smart” row keys
                    • Avoid incremental keys
                    • Use “smart” column names for nested
                        relationships


Tuesday, March 26, 13
Hadoop in Practice




Tuesday, March 26, 13
Issues

                    • Still in infancy
                    • Complex configuration options
                    • Confusing versioning
                    • Distributed systems are inherently complex

Tuesday, March 26, 13
Distributions
                    • Cloudera’s Distribution Including Hadoop
                        (CDH)
                    • MapR
                    • Hortonworks Data Platform (HDP)
                    • Amazon Elastic MapReduce (EMR)

Tuesday, March 26, 13
General Tips

                    • Use a distribution
                    • Monitor your cluster
                    • Understand how HDFS, MapReduce,
                        HBase, etc... works at a fundamental level




Tuesday, March 26, 13
Conclusion
                    • Hadoop is a great tool for working with big
                        data on “cheap” hardware
                    • Power comes at the cost of complexity
       Twitter: @shawnhermans
       Email: shawnhermans@gmail.com


Tuesday, March 26, 13
Backups



Tuesday, March 26, 13
Tuesday, March 26, 13
Key Papers
                                  Paper                      Technology

                        The Google Filesystem                  HDFS
                   MapReduce: Simplified Data Processing
                           on Large Clusters                 MapReduce
                   BigTable: A Distributed Storage System
                             for Structured Data               HBase
                  Pig Latin: A Not-So-Foreign Language for
                               Data Processing                  Pig
                 Dremel: Interactive Analysis of Web-Scale
                                 Datasets                      Impala




Tuesday, March 26, 13
Hadoop Versioning




Tuesday, March 26, 13

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopRan Ziv
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programPraveen Kumar Donta
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 

Was ist angesagt? (20)

Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Hadoop
HadoopHadoop
Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 

Andere mochten auch (10)

Une introduction à MapReduce
Une introduction à MapReduceUne introduction à MapReduce
Une introduction à MapReduce
 
NoSQL with Hadoop and HBase
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Hadoop Hbase - Introduction
Hadoop Hbase - IntroductionHadoop Hbase - Introduction
Hadoop Hbase - Introduction
 
Un introduction à Pig
Un introduction à PigUn introduction à Pig
Un introduction à Pig
 
Introduction à HDFS
Introduction à HDFSIntroduction à HDFS
Introduction à HDFS
 
Une introduction à Hive
Une introduction à HiveUne introduction à Hive
Une introduction à Hive
 
Hadopp Vue d'ensemble
Hadopp Vue d'ensembleHadopp Vue d'ensemble
Hadopp Vue d'ensemble
 
Une introduction à HBase
Une introduction à HBaseUne introduction à HBase
Une introduction à HBase
 

Ähnlich wie Omaha Java Users Group - Introduction to HBase and Hadoop

Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.elliando dias
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceCsaba Toth
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the roomcacois
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
 

Ähnlich wie Omaha Java Users Group - Introduction to HBase and Hadoop (20)

Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hbase jdd
Hbase jddHbase jdd
Hbase jdd
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Hadoop: The elephant in the room
Hadoop: The elephant in the roomHadoop: The elephant in the room
Hadoop: The elephant in the room
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisApache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
 

Kürzlich hochgeladen

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 

Kürzlich hochgeladen (20)

Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 

Omaha Java Users Group - Introduction to HBase and Hadoop

  • 1. Introduction to Hadoop/HBase Shawn Hermans Omaha Java Users Group March 19th, 2013 Tuesday, March 26, 13
  • 2. About Me • Mathematician/Physicist turned Consultant • Graduate Student in CS at UNO • Current Software Engineer at Sojern Tuesday, March 26, 13
  • 4. What is Hadoop? Hue Mahout Zookeeper Oozie HBase Whirr HDFS An Ecosystem of Open Source Tools for Handling Big Data Problems on Commodity Hardware Hive MapReduce Avro Pig HCatalog Sqoop MapReduce Flume YARN Tuesday, March 26, 13
  • 6. High Level • HDFS - Distributed File System • MapReduce - Scalable, Distributed Batch Processing • HBase - Low Latency, Random Access to Large Amounts of Data Retrieved from: http://en.wikipedia.org/wiki/Apache_Hadoop Tuesday, March 26, 13
  • 7. Hadoop Use Cases • Extract, Transform, Load Hadoop is • Data Mining Well Suited for “Big Data” • Data Warehousing Use Cases • Analytics Tuesday, March 26, 13
  • 8. What is Big Data? Data Source Size Gigabytes - Wikipedia Database Dump 9GB Normal size for relational databases Open Street Map 19GB Terabytes - Relational databases may Common Crawl 81TB start to experience scaling issues 1000 Genomes 200TB Petabytes - Relational databases Large Hadron Collider 15PB annually struggle to scale without a lot of fine tuning Tuesday, March 26, 13
  • 9. RDBMS vs Hadoop Organization Storage EBay 1.4PB IRS 150TB Large Postgres Installs Yahoo 2PB Organization Nodes Storage Quantcast 3,000 cores 3PB Large Hadoop Installs Yahoo 4,000 nodes 70PB Facebook 2,300+ nodes 100PB+ Tuesday, March 26, 13
  • 11. HDFS • Optimized for high throughput, not low latency • Not POSIX compliant • Not optimized for small files and random reads Tuesday, March 26, 13
  • 12. HDFS Examples hdfs dfs -mkdir hdfs://nn.example.com/tmp/foo hdfs dfs -rmr hdfs://nn.example.com/tmp/foo hdfs dfs -get hdfs://nn.example.com/tmp/foo /tmp/bar hdfs dfs -ls hdfs://nn.example.com/tmp/foo -rw-r--r-- 3 dev supergroup 0 2013-03-12 15:28 hdfs://../_SUCCESS drwxr-xr-x - dev supergroup 0 2013-03-12 15:27 hdfs://../_logs -rw-r--r-- 3 dev supergroup 46812001 2013-03-12 15:28 hdfs://../part-r-00000 Tuesday, March 26, 13
  • 13. Common File Formats • Avro • SequenceFile • RCFile • Protocol Buffers • Thrift • CSV/TSV Tuesday, March 26, 13
  • 14. Common Compression • Snappy • LZO • Gzip • Bzip Tuesday, March 26, 13
  • 16. MapReduce • Map - Emit key/ value pairs from data • Reduce - Collect data with common keys • Tries to minimize moving data between nodes Image taken from: https://developers.google.com/appengine/docs/python/dataprocessing/overview Tuesday, March 26, 13
  • 17. Word Count • Given a corpus of documents of documents, count the number of times each word occurs • The “Hello World” of MapReduce Tuesday, March 26, 13
  • 18. Mapper public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } Taken from http://wiki.apache.org/hadoop/WordCount Tuesday, March 26, 13
  • 19. Reducer 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } Taken from http://wiki.apache.org/hadoop/WordCount Tuesday, March 26, 13
  • 20. Program 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } Taken from http://wiki.apache.org/hadoop/WordCount Tuesday, March 26, 13
  • 21. Problem With MR • Cumbersome API • Too Low Level for Most Programs • Requires Writing Java (Unfamiliar to many data analysts) Tuesday, March 26, 13
  • 22. MapReduce Alternatives • Hive - SQL-like query language for Hadoop • Pig - Procedural, high-level language for creating Hadoop workflows • Cascading - Java API for creating creating complex MapReduce workflows • Cascalog - Clojure-based query language Tuesday, March 26, 13
  • 23. Hive SELECT  word,  COUNT(*)  FROM  input  LATERAL  VIEW   explode(split(text,  '  '))  lTable  as  word  GROUP   BY  word; Taken from: http://stackoverflow.com/questions/10039949/word-count-program-in-hive Tuesday, March 26, 13
  • 24. Pig input_lines  =  LOAD  '/tmp/the-­‐internet'  AS  (line:chararray); words  =  FOREACH  input_lines  GENERATE  FLATTEN(TOKENIZE(line))  AS  word; word_groups  =  GROUP  words  BY  word; word_count  =  FOREACH  word_groups  group,  GENERATE  COUNT(words); STORE  ordered_word_count  INTO  '/tmp/words-­‐counted'; Based on: http://en.wikipedia.org/wiki/Pig_(programming_tool) Tuesday, March 26, 13
  • 25. Cascading Scheme  sourceScheme  =  new  TextLine(  new  Fields(  "line"  )  ); Tap  source  =  new  Hfs(  sourceScheme,  inputPath  );   Scheme  sinkScheme  =  new  TextLine(  new  Fields(  "word",  "count"  )  ); Tap  sink  =  new  Hfs(  sinkScheme,  outputPath,  SinkMode.REPLACE  ); Pipe  assembly  =  new  Pipe(  "wordcount"  ); String  regex  =  "(?<!pL)(?=pL)[^  ]*(?<=pL)(?!pL)"; Function  function  =  new  RegexGenerator(  new  Fields(  "word"  ),  regex  ); assembly  =  new  Each(  assembly,  new  Fields(  "line"  ),  function  ); assembly  =  new  GroupBy(  assembly,  new  Fields(  "word"  )  ); Aggregator  count  =  new  Count(  new  Fields(  "count"  )  ); assembly  =  new  Every(  assembly,  count  );   Properties  properties  =  new  Properties(); FlowConnector.setApplicationJarClass(  properties,  Main.class  ); FlowConnector  flowConnector  =  new  FlowConnector(  properties  ); Flow  flow  =  flowConnector.connect(  "word-­‐count",  source,  sink,   assembly  ); flow.complete(); Taken from: http://docs.cascading.org/cascading/1.2/userguide/html/ch02.html Tuesday, March 26, 13
  • 26. PyCascading from  pycascading.helpers  import  * @udf_map(produces=['word']) def  split_words(tuple):        for  word  in  tuple.get(1).split():                yield  [word] def  main():        flow  =  Flow()        input  =  flow.source(Hfs(TextLine(),  'pycascading_data/town.txt'))        output  =  flow.tsv_sink('pycascading_data/out')        input  |  split_words  |  group_by('word',  native.count())  |  output        flow.run(num_reducers=2) From: https://github.com/twitter/pycascading/blob/master/examples/word_count.py Tuesday, March 26, 13
  • 27. Cascalog (defn  lowercase  [w]  (.toLowerCase  w))   (?<-­‐  (stdout)  [?word  ?count]      (sentence  ?s)  (split  ?s  :>  ?word1)    (lowercase  ?word1  :>  ?word)  (c/count  ?count)) Taken from: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-haado.html Tuesday, March 26, 13
  • 29. HBase • NoSQL • Distributed • Versioned • Timestamped • Columnar Tuesday, March 26, 13
  • 30. HBase Use Cases • Web Crawl Table • Large Scale Web Messaging • Sensor Data Collection • Analytics Tuesday, March 26, 13
  • 31. HBase Architecture • Rows are split across regions • Each region is hosted by a single RegionServer • Reads and writes to a single row are atomic • Row is a key/value pair container Tuesday, March 26, 13
  • 32. HBase Data Model (rowkey, column family, column, timestamp) -> value • Column Families Control Physical Storage • TTL • Versions • Compression • Bloom Filters Tuesday, March 26, 13
  • 33. HBase vs RDBMS • Horizontal Scaling • Atomic Across Single Row Only • No JOINs • No Indexes • No Data Types • Normalization = Bad in HBase Tuesday, March 26, 13
  • 34. HBase vs HDFS Batch Processing HBase is 4 to 5 times slower than HDFS Random Access HBase uses techniques to reduce read/write latency Small Records Normal HDFS Block size is 128MB or larger # of Records # of HDFS files limited by NameNode RAM Tuesday, March 26, 13
  • 35. Create Table hbase(main):001:0> hbase(main):001:0> create ‘table1’, {NAME => ‘f1’, VERSIONS => 1, TTL => 2592000},{NAME => ‘f2’}, {NAME => ‘f3’} 0 row(s) in 1.5700 seconds hbase(main):002:0> describe ‘table1’ DESCRIPTION ENABLED {NAME => ‘table1’, FAMILIES => [{NAME => ‘f1’, BLOOMFILTER => ‘NONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘1’, COMPRESSION => ‘N true ONE’, MIN_VERSIONS => ‘0’, TTL => ‘2592000’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘true’}, {NAME => ‘f2’, BLOOMFILTER => ‘NONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’, COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL => ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCKCACHE => ‘true’}, {NAME => ‘f3’, BLOOMFILTER => ‘NONE’, REPLICATION_SCOPE => ‘0’, VERSIONS => ‘3’, COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, TTL => ‘2147483647’, BLOCKSIZE => ‘65536’, IN_MEMORY => ‘false’, BLOCK CACHE => ‘true’}]} 1 row(s) in 0.0170 seconds hbase(main):003:0> Tuesday, March 26, 13
  • 36. Get Configuration  conf  =  HBaseConfiguration.create(); HTable  table  =  new  HTable(conf,  "testtable"); Get  get  =  new  Get(Bytes.toBytes("row1"));   get.addColumn(Bytes.toBytes("colfam1"),   Bytes.toBytes("qual1"));   Result  result  =  table.get(get); byte[]  val  =  result.getValue(Bytes.toBytes("colfam1"), Bytes.toBytes("qual1"));  System.out.println("Value:  "  +   Bytes.toString(val)); Example from HBase: The Definitive Guide Tuesday, March 26, 13
  • 37. Put Configuration  conf  =  HBaseConfiguration.create(); HTable  table  =  new  HTable(conf,  "testtable"); Put  put  =  new  Put(Bytes.toBytes("row1")); put.add(Bytes.toBytes("colfam1"),  Bytes.toBytes("qual1"),  Bytes.toBytes("val1")); put.add(Bytes.toBytes("colfam1"),  Bytes.toBytes("qual2"),  Bytes.toBytes("val2")); table.put(put); Example from HBase: The Definitive Guide Tuesday, March 26, 13
  • 38. Scan Scan  scan  =  new  Scan(); scan.addColumn(Bytes.toBytes("colfam1"),   Bytes.toBytes("col-­‐5")).    addColumn(Bytes.toBytes("colfam2"),   Bytes.toBytes("col-­‐33")).      setStartRow(Bytes.toBytes("row-­‐10")).      setStopRow(Bytes.toBytes("row-­‐20")); ResultScanner  scanner  =  table.getScanner(scan);   for  (Result  res  :  scanner)  {    System.out.println(res);   } scanner.close(); Example from HBase: The Definitive Guide Tuesday, March 26, 13
  • 39. HBase Schemas • Column Families control physical storage • Prefer short names for columns • Use “smart” row keys • Avoid incremental keys • Use “smart” column names for nested relationships Tuesday, March 26, 13
  • 41. Issues • Still in infancy • Complex configuration options • Confusing versioning • Distributed systems are inherently complex Tuesday, March 26, 13
  • 42. Distributions • Cloudera’s Distribution Including Hadoop (CDH) • MapR • Hortonworks Data Platform (HDP) • Amazon Elastic MapReduce (EMR) Tuesday, March 26, 13
  • 43. General Tips • Use a distribution • Monitor your cluster • Understand how HDFS, MapReduce, HBase, etc... works at a fundamental level Tuesday, March 26, 13
  • 44. Conclusion • Hadoop is a great tool for working with big data on “cheap” hardware • Power comes at the cost of complexity Twitter: @shawnhermans Email: shawnhermans@gmail.com Tuesday, March 26, 13
  • 47. Key Papers Paper Technology The Google Filesystem HDFS MapReduce: Simplified Data Processing on Large Clusters MapReduce BigTable: A Distributed Storage System for Structured Data HBase Pig Latin: A Not-So-Foreign Language for Data Processing Pig Dremel: Interactive Analysis of Web-Scale Datasets Impala Tuesday, March 26, 13