SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Efficient Object Model in Java


Slides by Zheng Shao, Facebook
Part of Apache Hadoop Hive Project
Object Inspector
On-disk Data Format
▪   Single on-disk form system
                       at     s
    ▪   Simplicity
▪   Multiple on-disk form system
                         at     s
    ▪   Ease-of-use
    ▪   Ease-of-integration
    ▪   Flexibility: better trade off between space, performance, etc
▪   Hive allow M
              s ultiple on-disk format
Exam M
    ple ultiple on-disk Formats
▪   File Format:
    ▪   Row-based
    ▪   Column-based
    ▪   Block-based
▪   Rowformat:
    ▪   Text-based
    ▪   Binary-based
    ▪   Customized
▪   Index format
In-m ory Data Form
    em            at
▪   Single in-m ory form system
               em       at     s
    ▪   Simplicity: Simpler code
▪   Multiple in-m ory form system
                 em       at     s
    ▪   Ease-of-integration: other system m use their ow form
                                         s ay           n    at
    ▪   Performance:
        ▪   Multiple on-disk format/external form + efficient loading
                                                 at
            M   ultiple in-m ory form
                             em         at
▪   Hive allow M
              s ultiple in-m ory form
                            em       at
Exam M
    ple ultiple in-m ory Form
                    em       ats
▪   Integer:
    ▪   Integer
    ▪   IntWritable
    ▪   LazyInteger
▪   String:
    ▪   String
    ▪   Text
Multiple In-m ory Form Design Patterns
             em       at
▪   Object-oriented:
    ▪   A single interface/base class for Integer
    ▪   Multiple derived classes
▪   Delegation:
    ▪   data stored in object
    ▪   format/operations stored in objectInspector
    ▪   a pair of object and objectInspector represents a data unit
▪   It’ possible to w either one up to conform to the other’ pattern.
      s              rap                                   s
Multiple In-m ory Form Design Patterns
             em       at
▪   In OO, w need an interface HiveInteger to represent Integers
            e
    ▪   Make Integer, IntWritable classes all implem it.
                                                    ent
    ▪   How ever, Integer class is final (not extendable) and does not
        implem HiveInteger
              ent
    ▪   W need to do a conversion, every tim w exchange data w UDF,
           e                                     e e                  ith
        SerDe (Thrift), or other libraries (unless they knowHiveInteger –this
        is a bad assum  ption to m  ake in open system  ).
▪   Delegation w be a better idea because
                ill
    ▪   For Integer, w have an JavaIntegerObjectInspector
                      e
    ▪   For IntWritable , w have an W
                           e         ritableIntegerObjectInspector
    ▪   W convert param and return values only if necessary
         e             s
Delegation Method List
▪   General methods:                   ▪   List Objects:
    ▪   isNull(object o)                   ▪   getListSize(object o)
    ▪   hashCode(object o)                 ▪   getListElement(object o)
    ▪   compare(object o)                  ▪   getList(object o)
    ▪   clone(object o)                ▪   M Objects:
                                            ap
▪   Primitive Objects:                     ▪   getMapSize(object o)
    ▪   primitive getValue(object o)       ▪   getValueForKey(object o)

▪   String Objects:                        ▪   getMap(object o)

    ▪   String getString(object o)     ▪   Struct Objects:
    ▪   Text getText(object o)             ▪   getStructField(object o)
                                           ▪   getStructAsAList(object o)
SerDe
Where is SerDe?
                                                    Hive Operator                            Hive Operator        Re duc e r
            Mappe r


ObjectInspector

                  Hierarchical                Hierarchical    Hierarchical            Hierarchical    Hierarchical
                    Object                      Object           Object
                                                             Standard Object            Object           Object
                                                                                                     LazyObject
                                 Java Object                 Use ArrayList for struct and            Lazily-deserialized
                                 Object of a Java            array
SerDe                            Class                       Use HashM for m
                                                                        ap        ap
                                                     Text(‘ p 1.0 3 54’// UTF8
                                                          im            )
           Writable W ritable          W ritable     encoded W  ritable     W ritable                        Writable
                     BytesW   ritable(x3Fx64x72x0           W ritable    W  ritable
                     0)
FileForm / Hadoop Serialization
        at


        File on                                                              Map
                         thrift_record<… > Stream
                            Stream                       im 1.0 3 54
                                                            p                                                     File on
        HDFS                                                                Output
                         thrift_record<… >               Im 0.2 1 33
                                                            p                                                     HDFS
                                                                             File
                         thrift_record<… >               clk 2.2 8 212
                         thrift_record<… >               Im 0.7 2 22
                                                            p
                                   User Script
SerDe, ObjectInspector and TypeInfo
                              “
                              av”                                                             int            int

     String Object
                           Obje c tIns pe c to r3                        string      string         struct
                                                    getType

              g e tMapValue


     Hierarchical                                 getMapValueOI                      HashMap<String, String> a,
                           Obje c tIns pe c to r2
       Object                    HashM    ap(“  “ getType“ ),
                                                a” av”“  bv”
                                                        , b”                 map      int         list
                                                                                         class HO {
                                                                                                            string
                                                                                           HashM   ap<String, String> a,
              g e tS truc tFie ld                                                          Integer b,
                                        List (                                             List<ClassC> c,
                                          HashM   ap(“  “ , “  “ ),
                                                      a” av” b” bv”                        String d;
       Hierarchical                                 getFieldOI
                          Obje c tIns pe23, r1
                                          c to                                           }
          Object                                       getType                           Class ClassC {
                                                                                     Struct
                                        List(List(1,null),List(2,4),List(5,null)),         Integer a,
                                          “
                                          abcd”                                            Integer b;      Type Info
de s e rialize s e rialize        S e rDe
                                        )      getOI                                     }

Writable        Writable             Text(‘
                                          a=av:b=bv 23 1:2=4:5                          BytesWritable(x3Fx64x72x0
                                     abcd’)                                             0)
LazySimpleSerDe components
                                                     byte[](‘a=av:b=bv 23 1:2=4:5
                                     byte[] data     abcd’  )



               LazyStruct                                                 LazyStructOI(“ )
                                                                                        “




LazyMap        LazyInteger     LazyArray      LazyString        LazyMapOI(“ ,” )
                                                                          :” =“           LazyArrayOI(“ )
                                                                                                      :”

                                        LazyStruct
                                                                                 LazyStringOI
  LazyString         LazyString               LazyInteger
                                                                  LazyStringOI
  LazyString         LazyString               LazyInteger
                                                                                           LazyStructOI(“ )
                                                                                                        =“
                                        LazyStruct

  Hierarchical Object / LazyObject            LazyInteger           LazyIntegerOI            StandardIntegerOI
      One Per SerDe instance
                                              LazyInteger                    LazyObjectInspector
                                                                                 Singleton
LazyPrimitive
▪   LazyString/LazyInteger
    ▪   setAll(byte[] data, int start, int length)
        ▪   LazyString: parse the data and create a String object
        ▪   LazyInteger: parse the data and create an Integer object
    ▪   getObject() –returns the corresponding String/Integer object
▪   Future
    ▪   Replace String/Integer w Text/IntW
                                ith       ritable
    ▪   The Text/IntWritable object is owned by the LazyString/LazyInteger
        object.
LazyNonPrimitive
▪   LazyStruct/LazyArray/LazyMap
    ▪   setAll(byte[] data, int start, int length)
        ▪   Rem ber data, start and length, and set parsed to false.
               em
    ▪   getStructField/getArrayElement/getMapValue
        ▪   If not parsed yet, parse the byte and rem ber starting positions of
                                                     em
            each field/element/key/value
        ▪   For Struct/Array, do setAll on the corresponding LazyObject and
            return it
        ▪   For M search for the serialized key and return the corresponding
                 ap,
            value (after doing a setAll on the value).
W another SerDe?
 hy
▪   Functionality:
    ▪   MetadataTypedColumnSetSerDe can only deal w String colum
                                                   ith          ns
    ▪   Dynam icSerDe can deal w all prim
                                    ith       itive colum and prim
                                                         ns       itive lists/
        maps, but it does not fully support nested types yet.
▪   Efficiency:
    ▪   Both M  etadataTypedColum     nSetSerDe and Dynam  icSerDe uses
        String.split() and are not efficient for long rows
Features of LazySimpleSerDe
▪   Functionality:
    ▪   Fully compatible w M
                          ith etaDataSerDe and Dynamic/TCTLSeparated
    ▪   Fully support all nested types (M Key m be prim
                                         ap    ust     itive)
▪   Efficiency:
    ▪   Fully support lazy deserialization - only deserialize the field (and
        create Objects) w hen asked.
    ▪   Reuse multiple-levels of LazyObjects.
    ▪   Read numbers without UTF-8 decoding
    ▪   (TODO) Fully reuse objects - IntWritable for Integer, Text for String
    ▪   (TODO) W num
                rite bers without UTF-8 encoding
Profiling result of a mapper
▪   17%: TrackedRecordReader (should include InputFileFormat and decompression)
▪   22%: Operator.close
▪   |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding)
▪   |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat)
▪   50%: Operator.forward
▪   |-18%: Text.decode (from LazySerDe)
▪   | |- 7%: CharacterSet.decode() (UTF-8 decoding)
▪   | |- 5%: toString() (where we create the string object)
▪   |- 3%: LazyStruct.parse (the code that search for separators in the row)
▪   |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData)
▪   |- 8%: GroupByOperator.processHashAggr
▪   |- 3%: HashMap.get() in GroupByOperator




▪   * Performance Data from Rodrigo Schmidt
TypeInfo String specification
▪   W not Thrift?
     hy
    ▪   Hard to parse
▪   Sim Syntax
       ple
    ▪   Type: PrimitiveType | MapType | ArrayType | StructType
    ▪   PrimitiveType: int | bigint | tinyint | smallint | double | string
    ▪   MapType: map<Type, Type>
    ▪   ArrayType: array<Type>
    ▪   StructType: struct< [Nam : Type]+ >
                                e
▪   Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
Future Works
Future Works of ObjectInspector
▪   Delegate all methods described earlier
    ▪   isNull(), hashCode(), compare() etc are not delegated yet
▪   Support UNION data type: HIVE-537
Future Works of SerDe
▪   LazyBinarySerDe: HIVE-553
    ▪   A binary-form sortable SerDe: serialized sorting order is the sam
                      at                                                 e
        as deserialized sorting order
    ▪   A binary-form com
                     at  pact SerDe: saving space

Weitere ähnliche Inhalte

Was ist angesagt?

OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011
Antony T Curtis
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
Vaibhav Jain
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
Seven Nguyen
 

Was ist angesagt? (19)

Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Avro introduction
Avro introductionAvro introduction
Avro introduction
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
 
Pig
PigPig
Pig
 
Apache Spark Workshop
Apache Spark WorkshopApache Spark Workshop
Apache Spark Workshop
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Python Objects
Python ObjectsPython Objects
Python Objects
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011OQGraph at MySQL Users Conference 2011
OQGraph at MySQL Users Conference 2011
 
SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템SICP_2.5 일반화된 연산시스템
SICP_2.5 일반화된 연산시스템
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
Hadoop - Stock Analysis
Hadoop - Stock AnalysisHadoop - Stock Analysis
Hadoop - Stock Analysis
 
2014 holden - databricks umd scala crash course
2014   holden - databricks umd scala crash course2014   holden - databricks umd scala crash course
2014 holden - databricks umd scala crash course
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 

Ähnlich wie Hive Object Model

Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
Cloudera, Inc.
 
Oop c++class(final).ppt
Oop c++class(final).pptOop c++class(final).ppt
Oop c++class(final).ppt
Alok Kumar
 

Ähnlich wie Hive Object Model (20)

Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Avro Data | Washington DC HUG
Avro Data | Washington DC HUGAvro Data | Washington DC HUG
Avro Data | Washington DC HUG
 
Unit 3
Unit 3Unit 3
Unit 3
 
Ruby1_full
Ruby1_fullRuby1_full
Ruby1_full
 
Ruby1_full
Ruby1_fullRuby1_full
Ruby1_full
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
The InfoGrid Graph DataBase
The InfoGrid Graph DataBaseThe InfoGrid Graph DataBase
The InfoGrid Graph DataBase
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Oop c++class(final).ppt
Oop c++class(final).pptOop c++class(final).ppt
Oop c++class(final).ppt
 
מיכאל
מיכאלמיכאל
מיכאל
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
core java
core javacore java
core java
 
The design, architecture, and tradeoffs of FluidDB
The design, architecture, and tradeoffs of FluidDBThe design, architecture, and tradeoffs of FluidDB
The design, architecture, and tradeoffs of FluidDB
 
JVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentationJVM Language Summit: Object layout presentation
JVM Language Summit: Object layout presentation
 
RaleighFS v5
RaleighFS v5RaleighFS v5
RaleighFS v5
 
Session 14 - Object Class
Session 14 - Object ClassSession 14 - Object Class
Session 14 - Object Class
 
Python redis talk
Python redis talkPython redis talk
Python redis talk
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Hive Object Model

  • 1. Efficient Object Model in Java Slides by Zheng Shao, Facebook Part of Apache Hadoop Hive Project
  • 3. On-disk Data Format ▪ Single on-disk form system at s ▪ Simplicity ▪ Multiple on-disk form system at s ▪ Ease-of-use ▪ Ease-of-integration ▪ Flexibility: better trade off between space, performance, etc ▪ Hive allow M s ultiple on-disk format
  • 4. Exam M ple ultiple on-disk Formats ▪ File Format: ▪ Row-based ▪ Column-based ▪ Block-based ▪ Rowformat: ▪ Text-based ▪ Binary-based ▪ Customized ▪ Index format
  • 5. In-m ory Data Form em at ▪ Single in-m ory form system em at s ▪ Simplicity: Simpler code ▪ Multiple in-m ory form system em at s ▪ Ease-of-integration: other system m use their ow form s ay n at ▪ Performance: ▪ Multiple on-disk format/external form + efficient loading at M ultiple in-m ory form em at ▪ Hive allow M s ultiple in-m ory form em at
  • 6. Exam M ple ultiple in-m ory Form em ats ▪ Integer: ▪ Integer ▪ IntWritable ▪ LazyInteger ▪ String: ▪ String ▪ Text
  • 7. Multiple In-m ory Form Design Patterns em at ▪ Object-oriented: ▪ A single interface/base class for Integer ▪ Multiple derived classes ▪ Delegation: ▪ data stored in object ▪ format/operations stored in objectInspector ▪ a pair of object and objectInspector represents a data unit ▪ It’ possible to w either one up to conform to the other’ pattern. s rap s
  • 8. Multiple In-m ory Form Design Patterns em at ▪ In OO, w need an interface HiveInteger to represent Integers e ▪ Make Integer, IntWritable classes all implem it. ent ▪ How ever, Integer class is final (not extendable) and does not implem HiveInteger ent ▪ W need to do a conversion, every tim w exchange data w UDF, e e e ith SerDe (Thrift), or other libraries (unless they knowHiveInteger –this is a bad assum ption to m ake in open system ). ▪ Delegation w be a better idea because ill ▪ For Integer, w have an JavaIntegerObjectInspector e ▪ For IntWritable , w have an W e ritableIntegerObjectInspector ▪ W convert param and return values only if necessary e s
  • 9. Delegation Method List ▪ General methods: ▪ List Objects: ▪ isNull(object o) ▪ getListSize(object o) ▪ hashCode(object o) ▪ getListElement(object o) ▪ compare(object o) ▪ getList(object o) ▪ clone(object o) ▪ M Objects: ap ▪ Primitive Objects: ▪ getMapSize(object o) ▪ primitive getValue(object o) ▪ getValueForKey(object o) ▪ String Objects: ▪ getMap(object o) ▪ String getString(object o) ▪ Struct Objects: ▪ Text getText(object o) ▪ getStructField(object o) ▪ getStructAsAList(object o)
  • 10. SerDe
  • 11. Where is SerDe? Hive Operator Hive Operator Re duc e r Mappe r ObjectInspector Hierarchical Hierarchical Hierarchical Hierarchical Hierarchical Object Object Object Standard Object Object Object LazyObject Java Object Use ArrayList for struct and Lazily-deserialized Object of a Java array SerDe Class Use HashM for m ap ap Text(‘ p 1.0 3 54’// UTF8 im ) Writable W ritable W ritable encoded W ritable W ritable Writable BytesW ritable(x3Fx64x72x0 W ritable W ritable 0) FileForm / Hadoop Serialization at File on Map thrift_record<… > Stream Stream im 1.0 3 54 p File on HDFS Output thrift_record<… > Im 0.2 1 33 p HDFS File thrift_record<… > clk 2.2 8 212 thrift_record<… > Im 0.7 2 22 p User Script
  • 12. SerDe, ObjectInspector and TypeInfo “ av” int int String Object Obje c tIns pe c to r3 string string struct getType g e tMapValue Hierarchical getMapValueOI HashMap<String, String> a, Obje c tIns pe c to r2 Object HashM ap(“  “ getType“ ), a” av”“  bv” , b” map int list class HO { string HashM ap<String, String> a, g e tS truc tFie ld Integer b, List ( List<ClassC> c, HashM ap(“  “ , “  “ ), a” av” b” bv” String d; Hierarchical getFieldOI Obje c tIns pe23, r1 c to } Object getType Class ClassC { Struct List(List(1,null),List(2,4),List(5,null)), Integer a, “ abcd” Integer b; Type Info de s e rialize s e rialize S e rDe ) getOI } Writable Writable Text(‘ a=av:b=bv 23 1:2=4:5 BytesWritable(x3Fx64x72x0 abcd’) 0)
  • 13. LazySimpleSerDe components byte[](‘a=av:b=bv 23 1:2=4:5 byte[] data abcd’ ) LazyStruct LazyStructOI(“ ) “ LazyMap LazyInteger LazyArray LazyString LazyMapOI(“ ,” ) :” =“ LazyArrayOI(“ ) :” LazyStruct LazyStringOI LazyString LazyString LazyInteger LazyStringOI LazyString LazyString LazyInteger LazyStructOI(“ ) =“ LazyStruct Hierarchical Object / LazyObject LazyInteger LazyIntegerOI StandardIntegerOI One Per SerDe instance LazyInteger LazyObjectInspector Singleton
  • 14. LazyPrimitive ▪ LazyString/LazyInteger ▪ setAll(byte[] data, int start, int length) ▪ LazyString: parse the data and create a String object ▪ LazyInteger: parse the data and create an Integer object ▪ getObject() –returns the corresponding String/Integer object ▪ Future ▪ Replace String/Integer w Text/IntW ith ritable ▪ The Text/IntWritable object is owned by the LazyString/LazyInteger object.
  • 15. LazyNonPrimitive ▪ LazyStruct/LazyArray/LazyMap ▪ setAll(byte[] data, int start, int length) ▪ Rem ber data, start and length, and set parsed to false. em ▪ getStructField/getArrayElement/getMapValue ▪ If not parsed yet, parse the byte and rem ber starting positions of em each field/element/key/value ▪ For Struct/Array, do setAll on the corresponding LazyObject and return it ▪ For M search for the serialized key and return the corresponding ap, value (after doing a setAll on the value).
  • 16. W another SerDe? hy ▪ Functionality: ▪ MetadataTypedColumnSetSerDe can only deal w String colum ith ns ▪ Dynam icSerDe can deal w all prim ith itive colum and prim ns itive lists/ maps, but it does not fully support nested types yet. ▪ Efficiency: ▪ Both M etadataTypedColum nSetSerDe and Dynam icSerDe uses String.split() and are not efficient for long rows
  • 17. Features of LazySimpleSerDe ▪ Functionality: ▪ Fully compatible w M ith etaDataSerDe and Dynamic/TCTLSeparated ▪ Fully support all nested types (M Key m be prim ap ust itive) ▪ Efficiency: ▪ Fully support lazy deserialization - only deserialize the field (and create Objects) w hen asked. ▪ Reuse multiple-levels of LazyObjects. ▪ Read numbers without UTF-8 decoding ▪ (TODO) Fully reuse objects - IntWritable for Integer, Text for String ▪ (TODO) W num rite bers without UTF-8 encoding
  • 18. Profiling result of a mapper ▪ 17%: TrackedRecordReader (should include InputFileFormat and decompression) ▪ 22%: Operator.close ▪ |-12%: DynamicSerDe.serialize (NOTE: This includes UTF-8 encoding) ▪ |- 4%: mapOutputBuffer.collect (should include compression and OutputFileFormat) ▪ 50%: Operator.forward ▪ |-18%: Text.decode (from LazySerDe) ▪ | |- 7%: CharacterSet.decode() (UTF-8 decoding) ▪ | |- 5%: toString() (where we create the string object) ▪ |- 3%: LazyStruct.parse (the code that search for separators in the row) ▪ |- 3%: Arrays.asList() (from UnionStructOI.getStructFieldData) ▪ |- 8%: GroupByOperator.processHashAggr ▪ |- 3%: HashMap.get() in GroupByOperator ▪ * Performance Data from Rodrigo Schmidt
  • 19. TypeInfo String specification ▪ W not Thrift? hy ▪ Hard to parse ▪ Sim Syntax ple ▪ Type: PrimitiveType | MapType | ArrayType | StructType ▪ PrimitiveType: int | bigint | tinyint | smallint | double | string ▪ MapType: map<Type, Type> ▪ ArrayType: array<Type> ▪ StructType: struct< [Nam : Type]+ > e ▪ Example: array<map<string,struct<a:int,b:array<string>,c:doube>>>
  • 21. Future Works of ObjectInspector ▪ Delegate all methods described earlier ▪ isNull(), hashCode(), compare() etc are not delegated yet ▪ Support UNION data type: HIVE-537
  • 22. Future Works of SerDe ▪ LazyBinarySerDe: HIVE-553 ▪ A binary-form sortable SerDe: serialized sorting order is the sam at e as deserialized sorting order ▪ A binary-form com at pact SerDe: saving space