Suche senden
Hochladen
Pivoting Data with SparkSQL by Andrew Ray
•
18 gefällt mir
•
16,393 views
Spark Summit
Folgen
Spark Summit East Talk
Weniger lesen
Mehr lesen
Daten & Analysen
Melden
Teilen
Melden
Teilen
1 von 33
Jetzt herunterladen
Downloaden Sie, um offline zu lesen
Empfohlen
Productizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
Databricks
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
Databricks
Spark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
Empfohlen
Productizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Databricks
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
Databricks
Flash for Apache Spark Shuffle with Cosco
Flash for Apache Spark Shuffle with Cosco
Databricks
Spark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
Databricks
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
Databricks
PySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
Delta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
Databricks
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
Weitere ähnliche Inhalte
Was ist angesagt?
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
Databricks
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
Databricks
PySpark Best Practices
PySpark Best Practices
Cloudera, Inc.
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
Delta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Databricks
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
Databricks
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Databricks
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks
Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
Was ist angesagt?
(20)
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
PySpark Best Practices
PySpark Best Practices
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Delta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Smart Join Algorithms for Fighting Skew at Scale
Smart Join Algorithms for Fighting Skew at Scale
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Rds data lake @ Robinhood
Rds data lake @ Robinhood
Ähnlich wie Pivoting Data with SparkSQL by Andrew Ray
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
DataWorks Summit
R Get Started II
R Get Started II
Sankhya_Analytics
RichardPughspatial.ppt
RichardPughspatial.ppt
EnnerHereniodeAlcnta
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
NoSQL
NoSQL
Antonio Castellon
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Huy Nguyen
Introduction to R for data science
Introduction to R for data science
Long Nguyen
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Data Con LA
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Databricks
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Torsten Steinbach
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
DB Tsai
Spark DataFrames for Data Munging
Spark DataFrames for Data Munging
(Susan) Xinh Huynh
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Daniel Chan
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
Syracuse University
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
MLconf
India software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
Satnam Singh
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
Torsten Steinbach
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
Amazon Web Services
Ähnlich wie Pivoting Data with SparkSQL by Andrew Ray
(20)
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
R Get Started II
R Get Started II
RichardPughspatial.ppt
RichardPughspatial.ppt
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
NoSQL
NoSQL
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Introduction to R for data science
Introduction to R for data science
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
Spark DataFrames for Data Munging
Spark DataFrames for Data Munging
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
India software developers conference 2013 Bangalore
India software developers conference 2013 Bangalore
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
Mehr von Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
Mehr von Spark Summit
(20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Kürzlich hochgeladen
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
Cathrine Wilhelmsen
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
Susanna-Assunta Sansone
Learn How Data Science Changes Our World
Learn How Data Science Changes Our World
Eduminds Learning
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
deepakthakur548787
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Boston Institute of Analytics
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
socarem879
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
SimranPal17
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
Dr Arash Najmaei ( Phd., MBA, BSc)
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
KarteekMane1
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
Boston Institute of Analytics
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Boston Institute of Analytics
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
HimangsuNath
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Thomas Poetter
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
Seán Kennedy
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
Tasha Penwell
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
aleedritatuxx
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Boston Institute of Analytics
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
VICTOR MAESTRE RAMIREZ
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
TecnoIncentive
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
17djon017
Kürzlich hochgeladen
(20)
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
Learn How Data Science Changes Our World
Learn How Data Science Changes Our World
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
Pivoting Data with SparkSQL by Andrew Ray
1.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 1 Pivoting Data with SparkSQL Andrew Ray Senior Data Engineer Silicon Valley Data Science
2.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 2 CODE git.io/vgy34 (github.com/silicon-valley-data-science/spark-pivot-examples)
3.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 3 OUTLINE • What’s a Pivot? • Syntax • Real world examples • Tips and Tricks • Implementation • Future work git.io/vgy34
4.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 4 WHAT’S A PIVOT?
5.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 5 WHAT’S A PIVOT?
6.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 6 WHAT’S A PIVOT? Group by A, pivot on B, and sum C A B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A X Y Z G 4 2 H 4 5
7.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 7 WHAT’S A PIVOT? Group by A and B Pivot on BA B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A B C G X 4 G Y 2 H Y 4 H Z 5 A X Y Z G 4 2 H 4 5
8.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 8 WHAT’S A PIVOT? Pivot on B (w/o agg.) Group by AA B C G X 1 G Y 2 G X 3 H Y 4 H Z 5 A X Y Z G 1 G 2 G 3 H 4 H 5 A X Y Z G 4 2 H 4 5
9.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 9 SYNTAX
10.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 10 SYNTAX • Dataframe/table with columns A, B, C, and D. • How to – group by A and B – pivot on C (with distinct values “small” and “large”) – sum of D
11.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 11 SYNTAX: COMPETITION • pandas (Python) – pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=np.sum) • reshape2 (R) – dcast(df, A + B ~ C, sum) • Oracle 11g – SELECT * FROM df PIVOT (sum(D) FOR C IN ('small', 'large')) p
12.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 12 SYNTAX: SPARKSQL • Simple – df.groupBy("A", "B").pivot("C").sum("D") • Explicit pivot values – df.groupBy("A", "B") .pivot("C", Seq("small", "large")) .sum("D")
13.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 13 PIVOT • Added to DataFrame API in Spark 1.6 – Scala – Java – Python – Not R L
14.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 14 REAL WORLD EXAMPLES
15.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 15 EXAMPLE 1: REPORTING • Retail sales • TPC-DS dataset – scale factor 1 • Docker image: docker run -it svds/spark-pivot-reporting TPC Benchmark™ DS - Standard Specification, Version 2.1.0 Page 18 of 135 2.2.2.3 The implementation chosen by the test sponsor for a particular datatype definition shall be applied consistently to all the instances of that datatype definition in the schema, except for identifier columns, whose datatype may be selected to satisfy database scaling requirements. 2.2.3 NULLs If a column definition includes an ‘N’ in the NULLs column this column is populated in every row of the table for all scale factors. If the field is blank this column may contain NULLs. 2.2.4 Foreign Key If the values in this column join with another column, the foreign columns name is listed in the Foreign Key field of the column definition. 2.3 Fact Table Definitions 2.3.1 Store Sales (SS) 2.3.1.1 Store Sales ER-Diagram 2.3.1.2 Store Sales Column Definitions Each row in this table represents a single lineitem for a sale made through the store channel and recorded in the store_sales fact table. Table 2-1 Store_sales Column Definitions Column Datatype NULLs Primary Key Foreign Key ss_sold_date_sk identifier d_date_sk ss_sold_time_sk identifier t_time_sk ss_item_sk (1) identifier N Y i_item_sk ss_customer_sk identifier c_customer_sk ss_cdemo_sk identifier cd_demo_sk ss_hdemo_sk identifier hd_demo_sk ss_addr_sk identifier ca_address_sk ss_store_sk identifier s_store_sk ss_promo_sk identifier p_promo_sk
16.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 16 SALES BY CATEGORY AND QUARTER sql("""select *, concat('Q', d_qoy) as qoy from store_sales join date_dim on ss_sold_date_sk = d_date_sk join item on ss_item_sk = i_item_sk""") .groupBy("i_category") .pivot("qoy") .agg(round(sum("ss_sales_price")/1000000,2)) .show
17.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 17 SALES BY CATEGORY AND QUARTER +-----------+----+----+----+----+ | i_category| Q1| Q2| Q3| Q4| +-----------+----+----+----+----+ | Books|1.58|1.50|2.84|4.66| | Women|1.41|1.36|2.54|4.16| | Music|1.50|1.44|2.66|4.36| | Children|1.54|1.46|2.74|4.51| | Sports|1.47|1.40|2.62|4.30| | Shoes|1.51|1.48|2.68|4.46| | Jewelry|1.45|1.39|2.59|4.25| | null|0.04|0.04|0.07|0.13| |Electronics|1.56|1.49|2.77|4.57| | Home|1.57|1.51|2.79|4.60| | Men|1.60|1.54|2.86|4.71| +-----------+----+----+----+----+
18.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 18 EXAMPLE 2: FEATURE GENERATION • MovieLens 1M Dataset – ~1M movie ratings – 6040 users – 3952 movies • Predict gender based on ratings – Using 100 most popular movies
19.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 19 LOAD RATINGS val ratings_raw = sc.textFile("Downloads/ml-1m/ratings.dat") case class Rating(user: Int, movie: Int, rating: Int) val ratings = ratings_raw.map(_.split("::").map(_.toInt)).map(r => Rating(r(0),r(1),r(2))).toDF ratings.show +----+-----+------+ |user|movie|rating| +----+-----+------+ | 11| 1753| 4| | 11| 1682| 1| | 11| 216| 4| | 11| 2997| 4| | 11| 1259| 3| ...
20.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 20 LOAD USERS val users_raw = sc.textFile("Downloads/ml-1m/users.dat") case class User(user: Int, gender: String, age: Int) val users = users_raw.map(_.split("::")).map(u => User(u(0).toInt, u(1), u(2).toInt)).toDF val sample_users = users.where(expr("gender = 'F' or ( rand() * 5 < 2 )")) sample_users.groupBy("gender").count().show +------+-----+ |gender|count| +------+-----+ | F| 1709| | M| 1744| +------+-----+
21.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 21 PREP DATA val popular = ratings.groupBy("movie") .count() .orderBy($"count".desc) .limit(100) .map(_.get(0)).collect val ratings_pivot = ratings.groupBy("user") .pivot("movie", popular.toSeq) .agg(expr("coalesce(first(rating),3)").cast("double")) ratings_pivot.where($"user" === 11).show +----+----+---+----+----+---+----+---+----+----+---+... |user|2858|260|1196|1210|480|2028|589|2571|1270|593|... +----+----+---+----+----+---+----+---+----+----+---+... | 11| 5.0|3.0| 3.0| 3.0|4.0| 3.0|3.0| 3.0| 3.0|5.0|... +----+----+---+----+----+---+----+---+----+----+---+...
22.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 22 BUILD MODEL val data = ratings_pivot.join(sample_users, "user") .withColumn("label", expr("if(gender = 'M', 1, 0)").cast("double")) val assembler = new VectorAssembler() .setInputCols(popular.map(_.toString)) .setOutputCol("features") val lr = new LogisticRegression() val pipeline = new Pipeline().setStages(Array(assembler, lr)) val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345) val model = pipeline.fit(training)
23.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 23 TEST val res = model.transform(test).select("label", "prediction") res.groupBy("label").pivot("prediction", Seq(1.0, 0.0)).count().show +-----+---+---+ |label|1.0|0.0| +-----+---+---+ | 1.0|114| 74| | 0.0| 46|146| +-----+---+---+ Accuracy 68%
24.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 24 TIPS AND TRICKS
25.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 25 USAGE NOTES • Specify the distinct values of the pivot column – Otherwise it does this: val values = df.select(pivotColumn) .distinct() .sort(pivotColumn) .map(_.get(0)) .take(maxValues + 1) .toSeq
26.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 26 MULTIPLE AGGREGATIONS df.groupBy("A", "B").pivot("C").agg(sum("D"), avg("D")).show +---+---+------------+------------+------------+------------+ | A| B|small_sum(D)|small_avg(D)|large_sum(D)|large_avg(D)| +---+---+------------+------------+------------+------------+ |foo|two| 6| 3.0| null| null| |bar|two| 6| 6.0| 7| 7.0| |foo|one| 1| 1.0| 4| 2.0| |bar|one| 5| 5.0| 4| 4.0| +---+---+------------+------------+------------+------------+
27.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 27 PIVOT MULTIPLE COLUMNS • Merge columns and pivot as usual df.withColumn(“p”, concat($”p1”, $”p2”)) .groupBy(“a”, “b”) .pivot(“p”) .agg(…)
28.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 28 MAX COLUMNS • spark.sql.pivotMaxValues – Default: 10,000 – When doing a pivot without specifying values for the pivot column this is the maximum number of (distinct) values that will be collected without error.
29.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 29 IMPLEMENTATION
30.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 30 PIVOT IMPLEMENTATION • pivot is a method of GroupedData and returns GroupedData with PivotType. • New logical operator: o.a.s.sql.catalyst.plans.logical.Pivot • Analyzer rule: o.a.s.sql.catalyst.analysis.Analyzer.ResolvePivot – Currently translates logical pivot into an aggregation with lots of if statements.
31.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 31 FUTURE WORK
32.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 32 FUTURE WORK • Add to R API • Add to SQL syntax • Add support for unpivot • Faster implementation
33.
© 2016 SILICON
VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. @SVDataScience | 33 Pivoting Data with SparkSQL Andrew Ray andrew@svds.com We’re hiring! svds.com/careers THANK YOU. git.io/vgy34
Jetzt herunterladen