Suche senden
Hochladen
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
•
Als PPTX, PDF herunterladen
•
2 gefällt mir
•
1,452 views
Dongjoon Hyun
Folgen
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
Weniger lesen
Mehr lesen
Ingenieurwesen
Melden
Teilen
Melden
Teilen
1 von 44
Jetzt herunterladen
Empfohlen
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
Empfohlen
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
Dongjoon Hyun
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
Performance Update: When Apache ORC Met Apache Spark
Performance Update: When Apache ORC Met Apache Spark
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
LLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
DataWorks Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
HiveWarehouseConnector
HiveWarehouseConnector
Eric Wohlstadter
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
ORC 2015
ORC 2015
t3rmin4t0r
Running Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
An Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
Josh Elser
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
Weitere ähnliche Inhalte
Was ist angesagt?
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
DataWorks Summit/Hadoop Summit
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
DataWorks Summit/Hadoop Summit
LLAP Nov Meetup
LLAP Nov Meetup
t3rmin4t0r
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Owen O'Malley
HiveWarehouseConnector
HiveWarehouseConnector
Eric Wohlstadter
HadoopFileFormats_2016
HadoopFileFormats_2016
Jakub Wszolek, PhD
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
DataWorks Summit
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hortonworks
Hive acid and_2.x new_features
Hive acid and_2.x new_features
Alberto Romero
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
ORC 2015
ORC 2015
t3rmin4t0r
Running Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
DataWorks Summit/Hadoop Summit
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
DataWorks Summit
An Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
DataWorks Summit
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
Josh Elser
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Owen O'Malley
Optimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
Was ist angesagt?
(20)
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
LLAP Nov Meetup
LLAP Nov Meetup
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
HiveWarehouseConnector
HiveWarehouseConnector
HadoopFileFormats_2016
HadoopFileFormats_2016
Next Generation Execution for Apache Storm
Next Generation Execution for Apache Storm
Ozone- Object store for Apache Hadoop
Ozone- Object store for Apache Hadoop
Hive acid and_2.x new_features
Hive acid and_2.x new_features
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
ORC 2015
ORC 2015
Running Services on YARN
Running Services on YARN
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
An Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
Apache Phoenix Query Server PhoenixCon2016
Apache Phoenix Query Server PhoenixCon2016
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
Optimizing Hive Queries
Optimizing Hive Queries
Ähnlich wie ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
DataWorks Summit
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Hortonworks
Apache spark 2.4 and beyond
Apache spark 2.4 and beyond
Xiao Li
Apache Spark and Object Stores
Apache Spark and Object Stores
Steve Loughran
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
alanfgates
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Hortonworks
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Steve Loughran
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Steve Loughran
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
DataWorks Summit/Hadoop Summit
Ähnlich wie ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
(20)
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Intro to Spark with Zeppelin
Intro to Spark with Zeppelin
Apache spark 2.4 and beyond
Apache spark 2.4 and beyond
Apache Spark and Object Stores
Apache Spark and Object Stores
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Hive2.0 big dataspain-nov-2016
Hive2.0 big dataspain-nov-2016
Apache Hive 2.0; SQL, Speed, Scale
Apache Hive 2.0; SQL, Speed, Scale
Apache Spark and Object Stores —for London Spark User Group
Apache Spark and Object Stores —for London Spark User Group
Spark Summit East 2017: Apache spark and object stores
Spark Summit East 2017: Apache spark and object stores
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
Kürzlich hochgeladen
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
JiananWang21
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
DineshKumar4165
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
RishantSharmaFr
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
ranjana rawat
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
Thermal Engineering Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
DineshKumar4165
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
fenichawla
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
NFPA 5000 2024 standard .
NFPA 5000 2024 standard .
DerechoLaboralIndivi
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
tanu pandey
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Christo Ananth
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
roncy bisnoi
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
ManishPatel169454
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Call Girls in Nagpur High Profile
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
Call Girls in Nagpur High Profile Call Girls
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
M Maged Hegazy, LLM, MBA, CCP, P3O
Kürzlich hochgeladen
(20)
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Thermal Engineering Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
NFPA 5000 2024 standard .
NFPA 5000 2024 standard .
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4
1.
1 © Hortonworks
Inc. 2011–2018. All rights reserved ORC Improvement & Roadmap in Apache Spark 2.3 and 2.4 Dongjoon Hyun Principal Software Engineer @ Hortonworks Data Science Team June 2018
2.
2 © Hortonworks
Inc. 2011–2018. All rights reserved Dongjoon Hyun • Hortonworks − Principal Software Engineer @ Data Science Team • Apache Project − Apache REEF Project Management Committee(PMC) Member & Committer − Apache Spark Project Contributor • GitHub − https://github.com/dongjoon-hyun
3.
3 © Hortonworks
Inc. 2011–2018. All rights reserved HDP 2.6.5 (May 2018) • Apache Spark − 2.3.0 (2018 FEB) • Apache ORC − 1.4.3 (2018 FEB) • Apache KAFKA − 1.0.0 (2017 NOV)
4.
4 © Hortonworks
Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
5.
5 © Hortonworks
Inc. 2011–2018. All rights reserved • Vectorized ORC Reader • Structured Streaming with ORC • Schema evolution with ORC • PySpark Performance Enhancements with Apache Arrow and ORC • Structured stream-stream joins • Spark History Server V2 • Spark on Kubernetes • Data source API V2 • Streaming API V2 • Continuous Structured Streaming Processing Major Features Experimental Features Apache Spark 2.3.x Spark 2.3.0 (and 2.3.1) has 1409 (and 134) JIRA issues.
6.
6 © Hortonworks
Inc. 2011–2018. All rights reserved Spark’s built-in file-based data sources • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables
7.
7 © Hortonworks
Inc. 2011–2018. All rights reserved Motivation • TEXT The simplest one with one string column schema • CSV Popular for data science workloads • JSON The most flexible one for schema changes • PARQUET The only one with vectorized reader • ORC Storage-efficient and popular for shared Hive tables Fast Flexible Hive Table Access
8.
8 © Hortonworks
Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive • Before Apache ORC − Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4
9.
9 © Hortonworks
Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB) HIVE-15841 Hive 2.3.0 ~ 2.3.3
10.
10 © Hortonworks
Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB) HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT) SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB) SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY) SPARK-24322 Spark 2.3.1 (JUN)
11.
11 © Hortonworks
Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB) HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT) SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB) SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY) SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY) SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
12.
12 © Hortonworks
Inc. 2011–2018. All rights reserved The story of Spark, ORC, and Hive – Cont. • Before Apache ORC − Hive 1.2.1 (2015 JUN) SPARK-2883 Hive 1.2.1 Spark 1.4 • After Apache ORC − v1.0.0 (2016 JAN) − v1.3.3 (2017 FEB) HIVE-15841 Hive 2.3.0 ~ 2.3.3 − v1.4.1 (2017 OCT) SPARK-22300 Spark 2.3.0 (FEB) − v1.4.3 (2018 FEB) SPARK-23340, HIVE-18674 Hive 3.0 (MAY) − v1.4.4 (2018 MAY) SPARK-24322 Spark 2.3.1 (JUN) − v1.5.1 (2018 MAY) SPARK-24576, HIVE-19669 Hive 3.1 Spark 2.4
13.
13 © Hortonworks
Inc. 2011–2018. All rights reserved Previous ORC Issues in Spark
14.
14 © Hortonworks
Inc. 2011–2018. All rights reserved Six Issue Categories • ORC Writer Versions • Performance • Structured streaming • Column names • Hive tables and schema evolution • Robustness
15.
15 © Hortonworks
Inc. 2011–2018. All rights reserved Category 1 – ORC Writer Versions • ORIGINAL • HIVE_8732 (2014) ORC string statistics are not merged correctly • HIVE_4243 (2015) Use real column names from Hive tables • HIVE_12055(2015) Vectorized Writer • HIVE_13083(2016) Decimals write present stream correctly • ORC_101 (2016) Correct the use of the default charset in bloomfilter • ORC_135 (2018) PPD for timestamp is wrong when reader/writer timezones are different
16.
16 © Hortonworks
Inc. 2011–2018. All rights reserved Category 2 – Performance • Vectorized ORC Reader (SPARK-16060) • Fast reading partition-columns (SPARK-22712) • Pushing down filters for DateType (SPARK-21787)
17.
17 © Hortonworks
Inc. 2011–2018. All rights reserved • `FileNotFoundException` at writing empty partitions as ORC • Create structured steam with ORC files Write (SPARK-15474) Read (SPARK-22781) Category 3 – Structured streaming spark.readStream.orc(path)
18.
18 © Hortonworks
Inc. 2011–2018. All rights reserved Category 4 – Column names • Unicode column names (SPARK-23072) • Column names with dot (SPARK-21791) • Should not create invalid column names (SPARK-21912)
19.
19 © Hortonworks
Inc. 2011–2018. All rights reserved Category 5 – Hive tables and schema evolution • Support `ALTER TABLE ADD COLUMNS` (SPARK-21929) − Introduced at Spark 2.2, but throws AnalysisException for ORC • Support column positional mismatch (SPARK-22267) − Return wrong result if ORC file schema is different from Hive MetaStore schema order • Support table properties during `convertMetastoreOrc/Parquet` (SPARK-23355, Spark 2.4) − For ORC/Parquet Hive tables, `convertMetastore` ignores table properties
20.
20 © Hortonworks
Inc. 2011–2018. All rights reserved Category 6 – Robustness • ORC metadata exceed ProtoBuf message size limit (SPARK-19109) • NullPointerException on zero-size ORC file (SPARK-19809) • Support `ignoreCorruptFiles` (SPARK-23049) • Support `ignoreMissingFiles` (SPARK-23305)
21.
21 © Hortonworks
Inc. 2011–2018. All rights reserved Current Approach
22.
22 © Hortonworks
Inc. 2011–2018. All rights reserved Supports two ORC file formats • Adding a new OrcFileFormat (SPARK-20682) FileFormat TextBasedFileFormat ParquetFileFormat OrcFileFormat HiveFileFormat JsonFileFormat LibSVMFileFormat CSVFileFormat TextFileFormat o.a.s.sql.execution.datasources o.a.s.ml.source.libsvmo.a.s.sql.hive.orc OrcFileFormat `hive` OrcFileFormat from Hive 1.2.1 `native` OrcFileFormat with ORC 1.4+
23.
23 © Hortonworks
Inc. 2011–2018. All rights reserved In Reality – Four cases for ORC Reader/Writer `hive` Reader`native` Reader `hive` Writer `native` Writer • New Data • New Apps • Best performance (Vectorized Reader) • New Data • Old Apps • Improved performance (Non-vectorized Reader) • Old Data • New Apps • Improved performance (Vectorized Reader) • Old Data • Old Apps • As-Is performance (Non-vectorized Reader) 1 2 3 4
24.
24 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / native reader native writer / hive reader hive writer / hive reader 4x 1 2 3 4 https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
25.
25 © Hortonworks
Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) • spark.sql.orc.impl=native (default: `hive`) CREATE TABLE people (name string, age int) USING ORC OPTIONS (orc.compress 'ZLIB') spark.read.orc(path) df.write.orc(path) spark.read.format("orc").load (path) df.write.format("orc").save(path) Read/Write Dataset Read/Write Dataset Create ORC Table
26.
26 © Hortonworks
Inc. 2011–2018. All rights reserved Switch ORC implementation (SPARK-20728) – Cont. • spark.sql.orc.impl=native (default: `hive`) spark.readStream.orc(path) spark.readStream.format("orc").load(path) df.writeStream .option("checkpointLocation", path1) .format("orc") .option("path", path2) .start Read/Write Structured Stream
27.
27 © Hortonworks
Inc. 2011–2018. All rights reserved Support vectorized read on Hive ORC Tables • spark.sql.hive.convertMetastoreOrc=true (default: false) − `spark.sql.orc.impl=native` is required, too. CREATE TABLE people (name string, age int) STORED AS ORC CREATE TABLE people (name string, age int) USING HIVE OPTIONS (fileFormat 'ORC', orc.compress 'gzip')
28.
28 © Hortonworks
Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources • Frequently, new files can have wider column types or new columns − Before SPARK-21929, users drop and recreate ORC table with an updated schema. • User-defined schema reduces schema inference cost and handles upcasting − boolean -> byte -> short -> int -> long − float -> double spark.read.schema("col1 int").orc(path) spark.read.schema("col1 long, col2 long").orc(path) Old Data New Data
29.
29 © Hortonworks
Inc. 2011–2018. All rights reserved Schema evolution at reading file-based data sources – Cont. 1. Native Vectorized ORC Reader 2. Only safe change via upcasting 3. JSON is the most flexible for changing types File Format TEXT CSV JSON ORC `hive` ORC `native`1 PARQUET Add Column At The End ✔️ ✔️ ✔️ ✔️ ✔️ Hide Trailing Column ✔️ ✔️ ✔️ ✔️ ✔️ Hide Column ✔️ ✔️ ✔️ Change Column Type2 ✔️ ✔️3 ✔️ Change Column Position ✔️ ✔️ ✔️
30.
30 © Hortonworks
Inc. 2011–2018. All rights reserved Performance
31.
31 © Hortonworks
Inc. 2011–2018. All rights reserved Micro Benchmark (Apache Spark 2.3.0) • Target − Apache Spark 2.3.0 − Apache ORC 1.4.1 • Machine − MacBook Pro (2015 Mid) − Intel® Core™ i7-4770JQ CPI @ 2.20GHz − Mac OS X 10.13.4 − JDK 1.8.0_161
32.
32 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Single column scan from wide tables Number of columns Time (ms) 1M rows with all BIGINT columns 0 200 400 600 800 1000 1200 100 200 300 native writer / native reader hive writer / hive reader 4x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
33.
33 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Vectorized Read 0 500 1000 1500 2000 2500 TINYINT SMALLINT INT BIGINT FLOAT DOULBE native hive 15M rows in a single-column table Time (ms) 10x 5x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 11x
34.
34 © Hortonworks
Inc. 2011–2018. All rights reserved Performance – Partitioned table read 0 500 1000 1500 2000 2500 Data column Partition column Both columns native hive Time (ms) 21x7x https://github.com/apache/spark/blob/branch-2.3/sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala 15M rows in a partitioned table
35.
35 © Hortonworks
Inc. 2011–2018. All rights reserved Predicate Pushdown 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Select 10% rows (id < value) Select 50% rows (id < value) Select 90% rows (id < value) Select all rows (id IS NOT NULL) parquet native Time (ms) https://github.com/apache/spark/blob/branch-2.3/sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala 15M rows with 5 data columns and 1 sequential id column
36.
36 © Hortonworks
Inc. 2011–2018. All rights reserved Demo
37.
37 © Hortonworks
Inc. 2011–2018. All rights reserved Support Matrix Future Roadmap
38.
38 © Hortonworks
Inc. 2011–2018. All rights reserved Support Matrix • Spark 2.3 and ORC 1.4 becomes GA at HDP 2.6.5. HDP 2.6.3~4 HDP 2.6.5 HDP 3.0 EA1 TP for ORC on Spark GA for ORC on Spark Early Access Spark 2.2 Spark 2.3.0+ Spark 2.3.1+ N/A ORC 1.4.3 ORC 1.4.3+ spark.sql.orc.enabled=true spark.sql.orc.impl=native spark.sql.orc.impl=native spark.sql.orc.char.enabled=true N/A N/A 1. https://hortonworks.com/info/early-access-hdp-3-0/
39.
39 © Hortonworks
Inc. 2011–2018. All rights reserved Future Roadmap – Targeting Apache Spark 2.4 (2018 Fall) Umbrella Issue • Feature Parity for ORC with Parquet SPARK-20901 Sub issues • Upgrade Apache ORC to 1.5.1 SPARK-24576 • Use `native` ORC implementation by default SPARK-23456 • Use ORC predicate pushdown by default SPARK-21783 • Use `convertMetastoreOrc` by default SPARK-22279 • Support table properties with `convertMetastoreOrc/Parquet` SPARK-23355 • Test ORC as default data source format SPARK-23553 • Test and support Bloom Filters SPARK-12417
40.
40 © Hortonworks
Inc. 2011–2018. All rights reserved Future Roadmap – On-going work • ORC Column-level encryption (with ORC 1.6) • Support VectorUDT/MatrixUDT (SPARK-22320) • Vectorized Writer with DataSource V2 • Support CHAR/VARCHAR Types • ALTER TABLE … CHANGE column type (SPARK-18727)
41.
41 © Hortonworks
Inc. 2011–2018. All rights reserved Summary • Like Hive, Apache Spark 2.3 starts to take advantage of Apache ORC − Improved feature parity between Spark and Hive • Native vectorized ORC reader − boosts Spark ORC performance − provides better schema evolution ability • Structured streaming starts to work with ORC (both reader/writer) • Spark is going to become faster and faster with ORC
42.
42 © Hortonworks
Inc. 2011–2018. All rights reserved Reference • https://www.slideshare.net/DongjoonHyun/orc-improvement-in-apache-spark-23, Dataworks Summit 2018 Berlin • https://youtu.be/EL-NHiwqCSY, ORC configuration in Apache Spark 2.3 • https://youtu.be/zJZ1gtzu-rs, Apache Spark 2.3 ORC with Apache Arrow • https://community.hortonworks.com/articles/148917/orc-improvements-for-apache- spark-22.html • https://www.slideshare.net/Hadoop_Summit/performance-update-when-apache-orc- met-apache-spark-81023199, Dataworks Summit 2017 Sydney • https://www.slideshare.net/Hadoop_Summit/orc-file-optimizing-your-big-data, Dataworks Summit 2017 San Jose
43.
43 © Hortonworks
Inc. 2011–2018. All rights reserved Questions?
44.
44 © Hortonworks
Inc. 2011–2018. All rights reserved Thank you
Jetzt herunterladen