SlideShare ist ein Scribd-Unternehmen logo
1 von 16
1
Parquet data format &
Impala overview
2
Agenda
• Objective
• Various data formats
• Use case
• Parquet
• Impala
3
Objective
• 2 fold:
• Quest for a more performant data format than
Avro for nested data
• Understand and test new data formats in general
4
Hadoop data formats
• Sequence file. It stores key-value pairs of data in
a flat binary file. Rows stored as values.
• ORC. Stores column oriented data. Added RLE
and Dictionary encoding, and statistics, single file
output. Will add Bloom filter.
• Avro. Data serialization framework: serialization
format & exchange service, for any language. Data
accompanied by schema (in JSON). Supports
schema evolution.
5
Parquet
• Columnar storage
• Automatic dictionary encoding and run-length
encoding. Separation of encoding vs compression.
• Run-length encoding: replaces sequences ("runs")
of consecutive repeated characters (or other units
of data) with a single character and the length of
the run.
• Dictionary encoding takes the different values
present in a column, and represents each one in
compact 2-byte form
6
Parquet
• Parquet can handle multiple schemas. Support
schema evolution.
• LogType A : organizationId, userId, timestamp,
recordId, cpuTime
• LogType V : userId, organizationId, timestamp,
foo, bar
• Can be used by any project in the Hadoop
ecosystem. Integrations provided for M/R, Pig,
Hive, Cascading and Impala.
7
Parquet
• SELECT vs INSERT.
• Parquet tables require relatively little memory to
query, because a query reads and decompresses data
in 8MB chunks.
• Inserting into a Parquet table is a more memory-
intensive operation because the data for each data file
(with a maximum size of 1GB) is stored in memory
until encoded, compressed, and written to disk.
8
Parquet
• Memory issues (Heap space error) resolved by:
• Reducing the parquet.block.size.The block size is the
size of a row group being buffered in memory and its
default value is 256 MB.
• The total memory allocated was around 1 GB.
• Using multiple Hive partitions -> multiple buffers were
getting created (one for writing into each partition ) .
• So writing data using parquet will always have a high
memory requirement .
• Hive’s Distribute by: was workaround to memory issues!
9
Parquet vs other formats
Performance test with 100G data over multiple queries
Parquet wins
10
Impala overview
• MPP implementation of a query engine
• Impala vs Hive: SQL queries for interactive
exploratory analytics on large data sets. Vs Hive,
runs as batch.
• Not using M/R – but uses HDFS
• Not CEP – closer to a RDBMS.
• Impala uses the same metadata store as Hive to
record information about table structure and
properties
11
Impala overview
• Can create a table in Hive, and use it in Impala
• E.g. Impala doesn’t support Avro, but Hive does
• Language is mix between SQL & HiveQL
• Requires a lot of memory (128 G min./node)
• Initial load of data via Refresh; can take a lot of time
• loads the block location data for newly added data
files
12
Impala overview
• Shortcomings
• Impala doesn’t support nested types at this point
(version 1.2.3) as long as it contains only Impala-
compatible data types – it cannot contain nested types
such as array, map, or struct.
• Impala currently does not "spill to disk"
• if intermediate results being processed on a node
exceed the memory reserved for Impala on that
node.
• No Custom Serializer/Deserializer classes (SerDes)
• Impala cancels a running query if any host on which that
query is executing fails
13
Impala overview
• Example. For create a PARQUET table in IMPALA there
are 3 ways:
• -> PARQUET table created in HIVE (with no nested
data types).
• -> Create and load with data a normal text table in
IMPALA:
• IMPALA> create table parquet_table_name LIKE
text_table_name STORED AS PARQUET LOCATION
/user/hdfs/..’;
• Create Parquet format table and then insert into parquet
table using normal text table.
• IMPALA> insert overwrite table parquet_table_name
select * from text_table_name;
14
Use Case
• Can't query Avro table in Impala because having
nested columns.
• Avro table created through Hive, we can use it in
Impala as long as it contains only Impala-compatible
data types.
• (cannot contain nested types such as array, map,
orstruct).
15
Use Case
• How to deal with nested XML data in Hadoop?
• There is no direct mapping from xml to avro. Process goes:
• Parse XML and Convert to Avro : Parse XML using XMLStreamReader and
• Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write
a java class for this.Tried using Parquet/Avro:
• Tested: Process Xml – first convert into Avro and then store into Parquet format using
parquet-avro apis.
• The problem is the Schema provided has some arrays which is union of type string and
null both.
• Currently this AvroSchemaConverter is not able to handle such avro schema and it gives
exception.
• Tested: Impala 1.2.3 on CDH 4.5
• Impala doesn’t support nested types at this point
16
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Java Virtual Machine (JVM), Difference JDK, JRE & JVM
Java Virtual Machine (JVM), Difference JDK, JRE & JVMJava Virtual Machine (JVM), Difference JDK, JRE & JVM
Java Virtual Machine (JVM), Difference JDK, JRE & JVMshamnasain
 
CNIT 152: 12b Windows Registry
CNIT 152: 12b Windows RegistryCNIT 152: 12b Windows Registry
CNIT 152: 12b Windows RegistrySam Bowne
 
HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7
HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7
HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7Hackito Ergo Sum
 
Practical Malware Analysis Ch12
Practical Malware Analysis Ch12Practical Malware Analysis Ch12
Practical Malware Analysis Ch12Sam Bowne
 
Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)Angel Boy
 
Injection on Steroids: Codeless code injection and 0-day techniques
Injection on Steroids: Codeless code injection and 0-day techniquesInjection on Steroids: Codeless code injection and 0-day techniques
Injection on Steroids: Codeless code injection and 0-day techniquesenSilo
 
ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!Mr. Vengineer
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationAngel Boy
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Amazon Web Services
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingSam Bowne
 
Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)
Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)
Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)Ahmed El-Arabawy
 
CISSP Prep: Ch 5. Communication and Network Security (Part 2)
CISSP Prep: Ch 5. Communication and Network Security (Part 2)CISSP Prep: Ch 5. Communication and Network Security (Part 2)
CISSP Prep: Ch 5. Communication and Network Security (Part 2)Sam Bowne
 
C format string vulnerability
C format string vulnerabilityC format string vulnerability
C format string vulnerabilitysluge
 
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic AnalysisCNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic AnalysisSam Bowne
 

Was ist angesagt? (20)

Java Virtual Machine (JVM), Difference JDK, JRE & JVM
Java Virtual Machine (JVM), Difference JDK, JRE & JVMJava Virtual Machine (JVM), Difference JDK, JRE & JVM
Java Virtual Machine (JVM), Difference JDK, JRE & JVM
 
CNIT 152: 12b Windows Registry
CNIT 152: 12b Windows RegistryCNIT 152: 12b Windows Registry
CNIT 152: 12b Windows Registry
 
HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7
HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7
HES2011 - Tarjei Mandt – Kernel Pool Exploitation on Windows 7
 
Practical Malware Analysis Ch12
Practical Malware Analysis Ch12Practical Malware Analysis Ch12
Practical Malware Analysis Ch12
 
DLL Injection
DLL InjectionDLL Injection
DLL Injection
 
Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)
 
Injection on Steroids: Codeless code injection and 0-day techniques
Injection on Steroids: Codeless code injection and 0-day techniquesInjection on Steroids: Codeless code injection and 0-day techniques
Injection on Steroids: Codeless code injection and 0-day techniques
 
ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!ARM Trusted FirmwareのBL31を単体で使う!
ARM Trusted FirmwareのBL31を単体で使う!
 
9: OllyDbg
9: OllyDbg9: OllyDbg
9: OllyDbg
 
MacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) ExploitationMacOS memory allocator (libmalloc) Exploitation
MacOS memory allocator (libmalloc) Exploitation
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
 
Asp.net mvc
Asp.net mvcAsp.net mvc
Asp.net mvc
 
CNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code AuditingCNIT 127: Ch 18: Source Code Auditing
CNIT 127: Ch 18: Source Code Auditing
 
Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)
Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)
Embedded Systems: Lecture 14: Introduction to GNU Toolchain (Binary Utilities)
 
CISSP Prep: Ch 5. Communication and Network Security (Part 2)
CISSP Prep: Ch 5. Communication and Network Security (Part 2)CISSP Prep: Ch 5. Communication and Network Security (Part 2)
CISSP Prep: Ch 5. Communication and Network Security (Part 2)
 
Api pour les nuls
Api pour les nulsApi pour les nuls
Api pour les nuls
 
PE File Format
PE File FormatPE File Format
PE File Format
 
C format string vulnerability
C format string vulnerabilityC format string vulnerability
C format string vulnerability
 
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic AnalysisCNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
 
Dll injection
Dll injectionDll injection
Dll injection
 

Ähnlich wie Parquet and impala overview external

HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storageSanSan149
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAsLuis Marques
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQueryCsaba Toth
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open houseJulien Le Dem
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquetNAVER D2
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquetManpreet Khurana
 

Ähnlich wie Parquet and impala overview external (20)

Hadoop
HadoopHadoop
Hadoop
 
Avro intro
Avro introAvro intro
Avro intro
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
6.hive
6.hive6.hive
6.hive
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
Hadoop storage
Hadoop storageHadoop storage
Hadoop storage
 
1650607.ppt
1650607.ppt1650607.ppt
1650607.ppt
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Apache Hive for modern DBAs
Apache Hive for modern DBAsApache Hive for modern DBAs
Apache Hive for modern DBAs
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
Storage in hadoop
Storage in hadoopStorage in hadoop
Storage in hadoop
 
Parquet Twitter Seattle open house
Parquet Twitter Seattle open houseParquet Twitter Seattle open house
Parquet Twitter Seattle open house
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Bdam presentation on parquet
Bdam presentation on parquetBdam presentation on parquet
Bdam presentation on parquet
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Parquet and impala overview external

  • 1. 1 Parquet data format & Impala overview
  • 2. 2 Agenda • Objective • Various data formats • Use case • Parquet • Impala
  • 3. 3 Objective • 2 fold: • Quest for a more performant data format than Avro for nested data • Understand and test new data formats in general
  • 4. 4 Hadoop data formats • Sequence file. It stores key-value pairs of data in a flat binary file. Rows stored as values. • ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter. • Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.
  • 5. 5 Parquet • Columnar storage • Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression. • Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run. • Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form
  • 6. 6 Parquet • Parquet can handle multiple schemas. Support schema evolution. • LogType A : organizationId, userId, timestamp, recordId, cpuTime • LogType V : userId, organizationId, timestamp, foo, bar • Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.
  • 7. 7 Parquet • SELECT vs INSERT. • Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks. • Inserting into a Parquet table is a more memory- intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.
  • 8. 8 Parquet • Memory issues (Heap space error) resolved by: • Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB. • The total memory allocated was around 1 GB. • Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) . • So writing data using parquet will always have a high memory requirement . • Hive’s Distribute by: was workaround to memory issues!
  • 9. 9 Parquet vs other formats Performance test with 100G data over multiple queries Parquet wins
  • 10. 10 Impala overview • MPP implementation of a query engine • Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch. • Not using M/R – but uses HDFS • Not CEP – closer to a RDBMS. • Impala uses the same metadata store as Hive to record information about table structure and properties
  • 11. 11 Impala overview • Can create a table in Hive, and use it in Impala • E.g. Impala doesn’t support Avro, but Hive does • Language is mix between SQL & HiveQL • Requires a lot of memory (128 G min./node) • Initial load of data via Refresh; can take a lot of time • loads the block location data for newly added data files
  • 12. 12 Impala overview • Shortcomings • Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala- compatible data types – it cannot contain nested types such as array, map, or struct. • Impala currently does not "spill to disk" • if intermediate results being processed on a node exceed the memory reserved for Impala on that node. • No Custom Serializer/Deserializer classes (SerDes) • Impala cancels a running query if any host on which that query is executing fails
  • 13. 13 Impala overview • Example. For create a PARQUET table in IMPALA there are 3 ways: • -> PARQUET table created in HIVE (with no nested data types). • -> Create and load with data a normal text table in IMPALA: • IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’; • Create Parquet format table and then insert into parquet table using normal text table. • IMPALA> insert overwrite table parquet_table_name select * from text_table_name;
  • 14. 14 Use Case • Can't query Avro table in Impala because having nested columns. • Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types. • (cannot contain nested types such as array, map, orstruct).
  • 15. 15 Use Case • How to deal with nested XML data in Hadoop? • There is no direct mapping from xml to avro. Process goes: • Parse XML and Convert to Avro : Parse XML using XMLStreamReader and • Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro: • Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis. • The problem is the Schema provided has some arrays which is union of type string and null both. • Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception. • Tested: Impala 1.2.3 on CDH 4.5 • Impala doesn’t support nested types at this point

Hinweis der Redaktion

  1. Also splittables.