ORC Files

•

49 likes•51,469 views

Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. The upcoming Hive 0.11 will add a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Finally, ORC works together with the upcoming query vectorization work providing a high bandwidth reader/writer interface.

© Hortonworks Inc. 2012
ORC Files
June 2013
Page 1
Owen O’Malley
owen@hortonworks.com
@owen_omalley
owen@hortonworks.com

© Hortonworks Inc. 2012
Who Am I?
Page 2

© Hortonworks Inc. 2012
History
Page 3

© Hortonworks Inc. 2012
Remaining Challenges
Page 4

© Hortonworks Inc. 2012
Requirements
Page 5

© Hortonworks Inc. 2012
File Structure
Page 6

© Hortonworks Inc. 2012
Stripe Structure
Page 7

© Hortonworks Inc. 2012
File Layout
Page 8
File Footer
Postscript
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Index Data
Row Data
Stripe Footer
256MBStripe
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4

© Hortonworks Inc. 2012
Compression
Page 9

© Hortonworks Inc. 2012
Integer Column Serialization
Page 10

© Hortonworks Inc. 2012
String Column Serialization
Page 11

© Hortonworks Inc. 2012
Hive Compound Types
Page 12
0
Struct
4
Struct
3
String
1
Int
2
Map
7
Time
5
String
6
Double

© Hortonworks Inc. 2012
Compound Type Serialization
Page 13

© Hortonworks Inc. 2012
Generic Compression
Page 14

© Hortonworks Inc. 2012
Column Projection
Page 15

© Hortonworks Inc. 2012
How Do You Use ORC
Page 16

© Hortonworks Inc. 2012
Managing Memory
Page 17

© Hortonworks Inc. 2012
Pavan’s Trick
Page 18

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 19

© Hortonworks Inc. 2012
Looking at ORC File Structures
Page 20

© Hortonworks Inc. 2012
TPC-DS File Sizes
Page 21

© Hortonworks Inc. 2012
TPC-DS Query Performance
Page 22

© Hortonworks Inc. 2012
Additional Details
Page 23

© Hortonworks Inc. 2012
Current work
Page 24

© Hortonworks Inc. 2012
Vectorization
Page 25

© Hortonworks Inc. 2012
Vectorization Preliminary Results
Page 26

© Hortonworks Inc. 2012
Future Work
Page 27

© Hortonworks Inc. 2012
Thanks!
Page 28

© Hortonworks Inc. 2012
Comparison
Page 29
RC File Trevni Parquet ORC File
Hive Type Model N N N Y
Separate complex columns N Y Y Y
Splits found quickly N Y Y Y
Default column group size 4MB 64MB* 64MB* 256MB
Files per a bucket 1 > 1 1* 1
Store min, max, sum, count N N N Y
Versioned metadata N Y Y Y
Run length data encoding N N Y Y
Store strings in dictionary N N N Y
Store row count N Y N Y
Skip compressed blocks N N N Y
Store internal indexes N N N Y

More Related Content

What's hot

Internal Hive

Recruit Technologies

An overview of Neo4j Internals

An overview of Neo4j Internals

An overview of Neo4j Internals

Tobias Lindaaker

Eric Hanson and I gave this presentation at Hadoop Summit 2013: Hive’s RCFile has been the standard format for storing Hive data for the last 3 years. However, RCFile has limitations because it treats each column as a binary blob without semantics. Hive 0.11 added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding — resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, LZO, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query. Columnar storage formats like ORC reduce I/O and storage use, but it’s just as important to reduce CPU usage. A technical breakthrough called vectorized query execution works nicely with column store formats to do this. Vectorized query execution has proven to give dramatic performance speedups, on the order of 10X to 100X, for structured data processing. We describe how we’re adding vectorized query execution to Hive, coupling it with ORC with a vectorized iterator.

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

ORC File Introduction

ORC File Introduction

ORC File Introduction

Integrating Apache Spark and NiFi for Data Lakes

Integrating Apache Spark and NiFi for Data Lakes

Integrating Apache Spark and NiFi for Data Lakes

DataWorks Summit/Hadoop Summit

Local Secondary Indexes in Apache Phoenix

Local Secondary Indexes in Apache Phoenix

Local Secondary Indexes in Apache Phoenix

Rajeshbabu Chintaguntla

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, different kinds of data have distinct properties. We've used three real schemas: * the NYC taxi data http://tinyurl.com/nyc-taxi-analysis * the Github access logs http://githubarchive.org * a typical sales fact table with generated data Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache. Speaker Owen O'Malley, Co-founder & Technical Fellow, Hortonworks

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

DataWorks Summit

ORC 2015

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

DataWorks Summit

With an ever increasing need to secure and limit access to sensitive data, enterprises today need an open source solution. Apache Atlas - which is the metadata and governance framework for Hadoop joins hands with Apache Ranger - security enforcement framework for Hadoop to address the need for compliance and security. Vimal will discuss the security and compliance requirements and demonstrate how the combination of Atlas and Ranger solves the problem. Vimal will focus on Tag based policy enforcement which is an elegant solution for large Hadoop clusters with wide variety of data

Tag based policies using Apache Atlas and Ranger

Tag based policies using Apache Atlas and Ranger

Tag based policies using Apache Atlas and Ranger

This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail. Below topics are explained in this Hadoop ecosystem presentation: 1. What is Hadoop ecosystem? 1. Pig (Scripting) 2. Hive (SQL queries) 3. Apache Spark (Real-time data analysis) 4. Mahout (Machine learning) 5. Apache Ambari (Management and monitoring) 6. Kafka & Storm 7. Apache Ranger & Apache Knox (Security) 8. Oozie (Workflow system) 9. Hadoop MapReduce (Data processing) 10. Hadoop Yarn (Cluster resource management) 11. Hadoop HDFS (Data storage) 12. Sqoop & Flume (Data collection and ingestion) What is this Big Data Hadoop training course about? The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab. What are the course objectives? This course will enable you to: 1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark 2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management 3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts 4. Get an overview of Sqoop and Flume and describe how to ingest data using them 5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning 6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution 7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations 8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS 9. Gain a working knowledge of Pig and its components 10. Do functional programming in Spark 11. Understand resilient distribution datasets (RDD) in detail 12. Implement and build Spark applications 13. Learn Spark SQL, creating, transforming, and querying Data frames 14. Understand the common use-cases of Spark and the various interactive algorithms Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg. Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS. Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg. You will learn: • The issues that arise when using the Hive table format at scale, and why we need a new table format • How a straightforward, elegant change in table format structure has enormous positive effects • The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it • The resulting benefits of this architectural design

Apache Iceberg: An Architectural Look Under the Covers

Apache Iceberg: An Architectural Look Under the Covers

Apache Iceberg: An Architectural Look Under the Covers

Hive tuning

Introduction SQL Analytics on Lakehouse Architecture

Introduction SQL Analytics on Lakehouse Architecture

Introduction SQL Analytics on Lakehouse Architecture

Data in Hadoop is getting bigger every day, consumers of the data are growing, organizations are now looking at making their Hadoop cluster compliant to federal regulations and commercial demands. Apache Ranger simplifies the management of security policies across all components in Hadoop. Ranger provides granular access controls to data. The deck describes what security tools are available in Hadoop and their purpose then it moves on to discuss in detail Apache Ranger.

Apache Ranger

Introduction to Impala

Introduction to Impala

Introduction to Impala

Spark SQL

Apache Iceberg - A Table Format for Hige Analytic Datasets

Apache Iceberg - A Table Format for Hige Analytic Datasets

Apache Iceberg - A Table Format for Hige Analytic Datasets

Apache Spark 2.3, released on February 2018, is the fourth release in 2.x line and has a lot of new improvements. One of the notable improvements is ORC support. Apache Spark 2.3 adds a native ORC file format implementation by using the latest Apache ORC 1.4.1. Users can switch between “native” and “hive” ORC file formats. Hive ORC file format is the existing one until Spark 2.2. In this talk, I'll talk about three key changes. First of all, performance. New native ORC implementation is faster 2x - 11x times on 10TB TPCDS benchmark. Vectorized query execution over ORC files improves Spark ORC query execution greatly. Especially, ORC filter pushdown can be faster than Parquet due to in-file indexes. Second, as a part of native ORC support, Spark 2.3 can convert the Hive ORC tables into Spark ORC data sources automatically. This solves several existing ORC issues and Spark 2.4 will enable it by default. Last, but not least, Spark 2.3 officially supports structural streaming over ORC data sources. You can create a streaming dataset over ORC files. Speaker Dongjoon Hyun, Staff Software Engineer, Hortonworks

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

DataWorks Summit

What's hot (20)

Internal Hive

An overview of Neo4j Internals

An overview of Neo4j Internals

An overview of Neo4j Internals

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

ORC File and Vectorization - Hadoop Summit 2013

ORC File Introduction

ORC File Introduction

ORC File Introduction

Integrating Apache Spark and NiFi for Data Lakes

Integrating Apache Spark and NiFi for Data Lakes

Integrating Apache Spark and NiFi for Data Lakes

Local Secondary Indexes in Apache Phoenix

Local Secondary Indexes in Apache Phoenix

Local Secondary Indexes in Apache Phoenix

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet

ORC 2015

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

Tag based policies using Apache Atlas and Ranger

Tag based policies using Apache Atlas and Ranger

Tag based policies using Apache Atlas and Ranger

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg Presentation for the St. Louis Big Data IDEA

Apache Iceberg: An Architectural Look Under the Covers

Apache Iceberg: An Architectural Look Under the Covers

Apache Iceberg: An Architectural Look Under the Covers

Hive tuning

Introduction SQL Analytics on Lakehouse Architecture

Introduction SQL Analytics on Lakehouse Architecture

Introduction SQL Analytics on Lakehouse Architecture

Apache Ranger

Introduction to Impala

Introduction to Impala

Introduction to Impala

Spark SQL

Apache Iceberg - A Table Format for Hige Analytic Datasets

Apache Iceberg - A Table Format for Hige Analytic Datasets

Apache Iceberg - A Table Format for Hige Analytic Datasets

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

ORC improvement in Apache Spark 2.3

Similar to ORC Files

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Inderaj (Raj) Bains

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

DataWorks Summit

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

DataWorks Summit

Speaker: John Randolph, Sr. Software Developer, Gexa Energy Level: 100 (Beginner) Track: Developer Gexa has implemented several applications using MongoDB as a document repository storing multiple types of files (PDF, XLS, CSV, etc.). This entry level session is intended to share what we’ve learned in developing and deploying our first applications in an on premise, Microsoft environment. We’ll provide architectural and development information about what we’ve done. The focus is to help get your projects up-to-speed more quickly. This will be useful to teams moving from pilot to production and for developers getting started with the .Net MongoDB drivers. Plenty of code samples will be shown. We’ll discuss our successful engagement with MongoDB Consulting to help us design and deploy a high-quality production environment. What You Will Learn: - Ideas how to store and retrieve documents of different sizes, types, and volumes. We’ll describe the storage, partitioning and indexing techniques used that provide sub-second retrieval from collections with over 100 million records. - The issues addressed moving to production, including: backup, disaster recovery, SSL, using replica sets, implementing authorization and authentication, changing default setting, and creating a full path-to-production set of environments. - A successful pattern for building applications with .Net, providing teams some ideas to jump-start their development along with tips and tricks for using the .Net drivers.

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

This presentation was given at the Strata + Hadoop World, 2015 in San Jose. Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data. In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine. Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer. They detailed and discussed the challenges of scalable SQL on Hadoop. The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark. And showed just how fast Hive on Spark really is.

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

SQL in the Hybrid World

SQL in the Hybrid World

SQL in the Hybrid World

Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.

Enabling R on Hadoop

Enabling R on Hadoop

Enabling R on Hadoop

DataWorks Summit

Cuando busca alternativas a Oracle en la nube, hacer el cambio puede parecer un trabajo duro. Entendemos que la migración involucra más que solo la base de datos. La compatibilidad es un punto clave, especialmente cuando se consideran los recursos que posiblemente ya haya invertido en Oracle, como por ejemplo el código de aplicación específico de Oracle.Este seminario web explorará las opciones y las principales consideraciones al pasar de las bases de datos de Oracle a la nube. - Revisión detallada de las ofertas de bases de datos disponibles en la nube - Factores críticos que se deben considerar considerar para elegir la oferta en la nube más adecuada - Cómo la experiencia de EDB con PostgreSQL puede ayudarlo en su decisión - Demostración de BigAnimal de EDB Présentateur: Sergio Romera, Senior Sales Engineer EMEA, EDB ------------------------------------------------------------ For more #webinars, visit http://bit.ly/EDB-Webinars Download free #PostgreSQL whitepapers: http://bit.ly/EDB-Whitepapers Read our #Postgres Blog http://bit.ly/EDB-Blogs Follow us on Facebook at http://bit.ly/EDB-FB Follow us on Twitter at http://bit.ly/EDB-Twitter Follow us on LinkedIn at http://bit.ly/EDB-LinkedIn Reach us via email at marketing@enterprisedb.com

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

The Apache Software Foundation

This topic describes the use of Spark and SequoiaDB in the Operational Data Lake of China’s financial industry, including how to use SequoiaDB to provide online high concurrent services and how to use Spark for data processing and machine learning. China has the world’s largest population, and also the world’s second largest economy. Many of the best technologies used in the United States and Europe are difficult to play effectively in China. This topic will show you how Spark and SequoiaDB are able to provide online financial services to billions of population.

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

The talk will be about the project to find a replacement for all IBM products in the company with the example for the databases. What was the goal of the project, the learning, a short overview about the options we migrated about 500 db2 databases to EnterpriseDB. The database size was from a small size up to 4 TB and we implemented a completely new fully automated deployment of VM and database. Databases are now 11 month in production. The talk will have an overview of the project, the learnings, a few parameters and technical parameters that were found for stability and performance.

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Connor McDonald

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

Connor McDonald

Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

DataWorks Summit

Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

DataWorks Summit

Similar to ORC Files (20)

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Using Apache Hive with High Performance

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

Optimizing Hive Queries

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

ORC: 2015 Faster, Better, Smaller

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

Getting Started with MongoDB Using the Microsoft Stack

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

Hive on spark is blazing fast or is it final

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

MOUG17 Keynote: Oracle OpenWorld Major Announcements

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

Data lake – On Premise VS Cloud

SQL in the Hybrid World

SQL in the Hybrid World

SQL in the Hybrid World

Enabling R on Hadoop

Enabling R on Hadoop

Enabling R on Hadoop

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

Migre sus bases de datos Oracle a la nube

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

ORC 2015: Faster, Better, Smaller

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migrating from RDBMS to MongoDB Atlas - Texas American Resources Company (TARC)

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

Migration DB2 to EDB - Project Experience

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

LA HUG - Agile Analytics Applications on HDP

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Things learned from OpenWorld 2013

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

Whats new in Oracle Database 12c release 12.1.0.2

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0?

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0 - Tokyo

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Big Data's Journey to ACID

Big Data's Journey to ACID

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads but with integrated support for finding required rows quickly. Owen O’Malley dives into the progress the Apache community made for adding fine-grained column-level encryption natively into ORC format, which also provides capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys, providing the additional flexibility to use and manage keys per column centrally.

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. ORC is a self-describing type-aware columnar file format designed for Hadoop workloads that provides optimized streaming reads, but with integrated support for finding required rows quickly. In this talk, we will outline the progress made in Apache community for adding fine-grained column level encryption natively into ORC format that will also provide capabilities to mask or redact data on write while protecting sensitive column metadata such as statistics to avoid information leakage. The column encryption capabilities will be fully compatible with Hadoop Key Management Server (KMS) and use the KMS to manage master keys providing the additional flexibility to use and manage keys per column centrally.

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data. Finally, the value of having open source benchmarks that are available to all interested parties is hugely important and all of the code is available from Apache.

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait. Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including: * All reads use snapshot isolation without locking. * No directory listings are required for query planning. * Files can be added, removed, or replaced atomically. * Full schema evolution supports changes in the table over time. * Partitioning evolution enables changes to the physical layout without breaking existing queries. * Data files are stored as Avro, ORC, or Parquet. * Support for Spark, Hive, and Presto.

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: reading all of the columns reading a few of the columns filtering using a filter predicate While previous work has compared the size and speed from Hive, this presentation will present benchmarks from Spark including the new work that radically improves the performance of Spark on ORC. This presentation will also include tips and suggestions to optimize the performance of your application while reading and writing the data.

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

ORC Column Encryption

ORC Column Encryption

Hadoop Summit June 2016 The landscape for storing your big data is quite complex, with several competing formats and different implementations of each format. Understanding your use of the data is critical for picking the format. Depending on your use case, the different formats perform very differently. Although you can use a hammer to drive a screw, it isn’t fast or easy to do so. The use cases that we’ve examined are: * reading all of the columns * reading a few of the columns * filtering using a filter predicate * writing the data Furthermore, it is important to benchmark on real data rather than synthetic data. We used the Github logs data available freely from http://githubarchive.org We will make all of the benchmark code open source so that our experiments can be replicated.

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

From Hadoop Summit 2015, San Jose From Apache BigData 2016, Vancouver Hadoop has long had strong authentication via integration with Kerberos, authorization via User/Group/Other HDFS permissions, and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to give important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC columnar format will also be covered.

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Hadoop has long had strong authentication via integration with Kerberos, authorization via user/group/other HDFS permissions and auditing via the audit log. Recent developments in Hadoop have added HDFS file access control lists, pluggable encryption key provider APIs, HDFS snapshots, and HDFS encryption zones. These features combine to given important new data protection features that every company should be using to protect their data. This talk will cover what the new features are and when and how to use them in enterprise production environments. Upcoming features including columnar encryption in the ORC file format will also be covered.

Data protection2015

Data protection2015

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Hadoop Security Architecture

Hadoop Security Architecture

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation Hadoop Operations

The next generation of Hadoop MapReduce Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization. More information and video available at: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/

Next Generation MapReduce

Next Generation MapReduce

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

More from Owen O'Malley (18)

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Running An Apache Project: 10 Traps and How to Avoid Them

Big Data's Journey to ACID

Big Data's Journey to ACID

Big Data's Journey to ACID

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fine Grain Access Control for Big Data: ORC Column Encryption

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Fast Access to Your Data - Avro, JSON, ORC, and Parquet

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Strata NYC 2018 Iceberg

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet

ORC Column Encryption

ORC Column Encryption

ORC Column Encryption

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

File Format Benchmarks - Avro, JSON, ORC, & Parquet

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Protecting Enterprise Data in Apache Hadoop

Data protection2015

Data protection2015

Data protection2015

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Structor - Automated Building of Virtual Hadoop Clusters

Hadoop Security Architecture

Hadoop Security Architecture

Hadoop Security Architecture

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Adding ACID Updates to Hive

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation Hadoop Operations

Next Generation MapReduce

Next Generation MapReduce

Next Generation MapReduce

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Bay Area HUG Feb 2011 Intro

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

Plugging the Holes: Security and Compatability in Hadoop

ORC Files

1. © Hortonworks Inc. 2012 ORC Files June 2013 Page 1 Owen O’Malley owen@hortonworks.com @owen_omalley owen@hortonworks.com

2. © Hortonworks Inc. 2012 Who Am I? Page 2

3. © Hortonworks Inc. 2012 History Page 3

4. © Hortonworks Inc. 2012 Remaining Challenges Page 4

5. © Hortonworks Inc. 2012 Requirements Page 5

6. © Hortonworks Inc. 2012 File Structure Page 6

7. © Hortonworks Inc. 2012 Stripe Structure Page 7

8. © Hortonworks Inc. 2012 File Layout Page 8 File Footer Postscript Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Index Data Row Data Stripe Footer 256MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4

9. © Hortonworks Inc. 2012 Compression Page 9

10. © Hortonworks Inc. 2012 Integer Column Serialization Page 10

11. © Hortonworks Inc. 2012 String Column Serialization Page 11

12. © Hortonworks Inc. 2012 Hive Compound Types Page 12 0 Struct 4 Struct 3 String 1 Int 2 Map 7 Time 5 String 6 Double

13. © Hortonworks Inc. 2012 Compound Type Serialization Page 13

14. © Hortonworks Inc. 2012 Generic Compression Page 14

15. © Hortonworks Inc. 2012 Column Projection Page 15

16. © Hortonworks Inc. 2012 How Do You Use ORC Page 16

17. © Hortonworks Inc. 2012 Managing Memory Page 17

18. © Hortonworks Inc. 2012 Pavan’s Trick Page 18

19. © Hortonworks Inc. 2012 Looking at ORC File Structures Page 19

20. © Hortonworks Inc. 2012 Looking at ORC File Structures Page 20

21. © Hortonworks Inc. 2012 TPC-DS File Sizes Page 21

22. © Hortonworks Inc. 2012 TPC-DS Query Performance Page 22

23. © Hortonworks Inc. 2012 Additional Details Page 23

24. © Hortonworks Inc. 2012 Current work Page 24

25. © Hortonworks Inc. 2012 Vectorization Page 25

26. © Hortonworks Inc. 2012 Vectorization Preliminary Results Page 26

27. © Hortonworks Inc. 2012 Future Work Page 27

28. © Hortonworks Inc. 2012 Thanks! Page 28

29. © Hortonworks Inc. 2012 Comparison Page 29 RC File Trevni Parquet ORC File Hive Type Model N N N Y Separate complex columns N Y Y Y Splits found quickly N Y Y Y Default column group size 4MB 64MB* 64MB* 256MB Files per a bucket 1 > 1 1* 1 Store min, max, sum, count N N N Y Versioned metadata N Y Y Y Run length data encoding N N Y Y Store strings in dictionary N N N Y Store row count N Y N Y Skip compressed blocks N N N Y Store internal indexes N N N Y