Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

•Download as PPTX, PDF•

4 likes•1,159 views

In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.

Data & Analytics

Virtualizing Analytics
with Apache Spark
Arsalan Tavakoli-Shiraji
Spark Summit East 2017

Enterprise aspirations:
More data, more intelligence

ANALYTICS
PEOPL
E
DATA
3 pillars of any data-driven use case

Data: Bigger, messier, more spread
out
DATA • Spread out into silos
• Varying types and structure
• Faster Velocity

Analytics: More variety and
complexity
• Multiple approaches
• Iterative discovery
• Difficult to productionize
ANALYTICS

People: Collaboration from start to
finish
PEOPLE • Many roles involved
• Diverse skillsets and goals
• Inefficient hand-offs

DATA
Only structured data;
Costly to scale
First Generation: The Data
Warehouse
Reporting on small data ANALYTIC
S
PEOPL
E
SQL only
Targeted at BI

ANALYTIC
S
PEOPL
E
Disparate
and complex
tools
Limited to developers with big data expertise
Second Generation: Hadoop + Data
Lake
Capture data first, ETL later
DATA
Hard to centralize the data;
Limited value without ETL

VIRTUAL
ANALYTICS
Decoupled compute and storage
Uniform data management and
security model
Unified analytics engine
Enterprise-wide collaboration
Data Warehouses
DATA
Cloud
storage
Cloud
Storage
And many
others…
Hadoop Storage
PEOPLE
Data Science
Data Engineering
And many
others…
BI Analysts
The New Paradigm

Is Spark the Answer?
Data Warehouses
DATA
Cloud
storage
Cloud
Storage
And many
others…
Hadoop Storage
PEOPLE
Data Science
Data Engineering
And many
others…
BI Analysts

Databricks + Apache Spark
Managed
Cloud Platform
Integrated Workspace
Production
Workflow
Automation
Optimized
Data Access
Layer
Databricks Enterprise Security
Data Warehouses
DATA
Cloud
storage
Many others…
Cloud
Storage
And many
others…
Hadoop Storage
PEOPLE
Data Science
Data Engineering
And many
others…
BI Analysts

Case Study |
Video quality
Real-time anomaly
detection
Viewer loyalty
Grow the Viacom audience

What's hot

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit

Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks

How Spark Enables the Internet of Things- Paula Ta-ShmaSpark Summit

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)Spark Summit

Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks

Spark Summit EU 2015: Matei Zaharia keynoteDatabricks

Building Robust Production Data Pipelines with Databricks DeltaDatabricks

Clinical Suspecting at Scale Using PySparkDatabricks

Lambda architecture for real time big dataTrieu Nguyen

Data Warehousing with Spark Streaming at ZalandoDatabricks

Translating Models to Medicine an Example of Managing Visual CommunicationsDatabricks

2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demoDatabricks

Disrupting Big Data with Apache Spark in the CloudJen Aman

Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson

The Power of Unified Analytics with Ali Ghodsi Databricks

Spark in the Enterprise - 2 Years Later by Alan SaldichSpark Summit

Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...Spark Summit

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

What's hot (20)

Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...

Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...

How Spark Enables the Internet of Things- Paula Ta-Shma

Spark and Hadoop at Production Scale-(Anil Gadre, MapR)

Simplify and Scale Data Engineering Pipelines with Delta Lake

Spark Summit EU 2015: Matei Zaharia keynote

Building Robust Production Data Pipelines with Databricks Delta

Clinical Suspecting at Scale Using PySpark

Lambda architecture for real time big data

Data Warehousing with Spark Streaming at Zalando

Translating Models to Medicine an Example of Managing Visual Communications

2016 Spark Summit East Keynote: Ali Ghodsi and Databricks Community Edition demo

Disrupting Big Data with Apache Spark in the Cloud

Databricks + Snowflake: Catalyzing Data and AI Initiatives

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

The Power of Unified Analytics with Ali Ghodsi

Spark in the Enterprise - 2 Years Later by Alan Saldich

Mastering Your Customer Data on Apache Spark by Elliott Cordo

Scaling Through Simplicity—How a 300 million User Chat App Reduced Data Engin...

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...

Viewers also liked

Trends for Big Data and Apache Spark in 2017 by Matei ZahariaSpark Summit

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...Spark Summit

Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit

Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...Spark Summit

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit

Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...Spark Summit

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit

Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit

Optimizing Apache Spark SQL JoinsDatabricks

Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit

Viewers also liked (20)

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...

IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and M...

Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung

Realtime Analytical Query Processing and Predictive Model Building on High Di...

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...

Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...

Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...

Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...

Apache Spark in Cloud and Hybrid: Why Security and Governance Become More Imp...

Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...

High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Optimizing Apache Spark SQL Joins

Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...

Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...

Similar to Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

DevOps Spain 2019. Olivier Perard-OracleatSistemas

BDS14 Big Data Analytics to the massesJose Luis Lopez Pino

1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi

TOUG Big Data Challenge and ImpactToronto-Oracle-Users-Group

Here are some of the things our Data Analytics team can doLoren Moss

The Rensselaer IDEA: Data Exploration James Hendler

Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul

data analytics lecture3.pptNamrataBhatt8

BAR360 open data platform presentation at DAMA, SydneySai Paravastu

Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Lucas Jellema

7 ideas on encouraging advanced analyticsMark Tabladillo

Multi Source Data Analysis using Spark and Telliusdatamantra

Big data rmougGwen (Chen) Shapira

Hadoop and SAP BI Praveen Kumar (Tyagi)

Data lake-itweekend-sharif university-vahid amirydatastack

Oh! Session on Introduction to BIG DataPrakalp Agarwal

Forging Cultural Change: Transforming Your Organization Into a Data-Driven Ma...Erika Roach

The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy

Innovation med big data – chr. hansens erfaringerMicrosoft

The Power of DataDataWorks Summit

Similar to Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli (20)

DevOps Spain 2019. Olivier Perard-Oracle

BDS14 Big Data Analytics to the masses

1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...

TOUG Big Data Challenge and Impact

Here are some of the things our Data Analytics team can do

The Rensselaer IDEA: Data Exploration

Bitkom Cray presentation - on HPC affecting big data analytics in FS

data analytics lecture3.ppt

BAR360 open data platform presentation at DAMA, Sydney

Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...

7 ideas on encouraging advanced analytics

Multi Source Data Analysis using Spark and Tellius

Big data rmoug

Hadoop and SAP BI

Data lake-itweekend-sharif university-vahid amiry

Oh! Session on Introduction to BIG Data

Forging Cultural Change: Transforming Your Organization Into a Data-Driven Ma...

The Python ecosystem for data science - Landscape Overview

Innovation med big data – chr. hansens erfaringer

The Power of Data

Recently uploaded

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

04242024_CCC TUG_Joins and Relationshipsccctableauusergroup

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一ffjhghh

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Mature dropshipping via API with DroFx.pptxolyaivanovalion

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor

Invezz.com - Grow your wealth with trading signalsInvezz1

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat

Halmar dropshipping via API with DroFxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Data-Analysis for Chicago Crime Data 2023ymrp368

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Midocean dropshipping via API with DroFxolyaivanovalion

Recently uploaded (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

04242024_CCC TUG_Joins and Relationships

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改

BabyOno dropshipping via API with DroFx.pptx

定制英国白金汉大学毕业证（UCB毕业证书）成绩单原版一比一

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Mature dropshipping via API with DroFx.pptx

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...

Invezz.com - Grow your wealth with trading signals

Schema on read is obsolete. Welcome metaprogramming..pdf

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

CebaBaby dropshipping via API with DroFX.pptx

Sampling (random) method and Non random.ppt

(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service

Halmar dropshipping via API with DroFx

Generative AI on Enterprise Cloud with NiFi and Milvus

Edukaciniai dropshipping via API with DroFx

Data-Analysis for Chicago Crime Data 2023

Log Analysis using OSSEC sasoasasasas.pptx

Midocean dropshipping via API with DroFx

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

1. Virtualizing Analytics with Apache Spark Arsalan Tavakoli-Shiraji Spark Summit East 2017

2. Enterprise aspirations: More data, more intelligence

3. So what’s the formula for success?

4. ANALYTICS PEOPL E DATA 3 pillars of any data-driven use case

5. Data: Bigger, messier, more spread out DATA • Spread out into silos • Varying types and structure • Faster Velocity

6. Analytics: More variety and complexity • Multiple approaches • Iterative discovery • Difficult to productionize ANALYTICS

7. People: Collaboration from start to finish PEOPLE • Many roles involved • Diverse skillsets and goals • Inefficient hand-offs

8. Can we reuse existing technologies?

9. DATA Only structured data; Costly to scale First Generation: The Data Warehouse Reporting on small data ANALYTIC S PEOPL E SQL only Targeted at BI

10. ANALYTIC S PEOPL E Disparate and complex tools Limited to developers with big data expertise Second Generation: Hadoop + Data Lake Capture data first, ETL later DATA Hard to centralize the data; Limited value without ETL

11. VIRTUAL ANALYTICS Decoupled compute and storage Uniform data management and security model Unified analytics engine Enterprise-wide collaboration Data Warehouses DATA Cloud storage Cloud Storage And many others… Hadoop Storage PEOPLE Data Science Data Engineering And many others… BI Analysts The New Paradigm

12. Is Spark the Answer? Data Warehouses DATA Cloud storage Cloud Storage And many others… Hadoop Storage PEOPLE Data Science Data Engineering And many others… BI Analysts

13. Databricks + Apache Spark Managed Cloud Platform Integrated Workspace Production Workflow Automation Optimized Data Access Layer Databricks Enterprise Security Data Warehouses DATA Cloud storage Many others… Cloud Storage And many others… Hadoop Storage PEOPLE Data Science Data Engineering And many others… BI Analysts

14. Case Study | Video quality Real-time anomaly detection Viewer loyalty Grow the Viacom audience

15. The Road Ahead

Editor's Notes

In every industry sector I’ve encountered, the interest in big data is stronger than ever. Why are they so interested? They believe data is the key to transforming their businesses. You’ve already heard of some of these examples; Yesterday, Salesforce came on stage and talked about their plan to build their next-generation CRM product with AI – what they call Einstein. And they are using Spark. Today, we will hear from the likes of HP – a pedigreed company built on manufacturing devices, and who is using Spark to create a service-based business model with IoT data Or another familiar name – McGraw Hill – who has been creating education material for decades but is now looking to Spark to revolutionize learning. They want to use behavior data from students to identify gaps in understanding and provide personalized learning approaches to achieve better outcomes. Many of the companies we talk to aspire to leverage greater intelligence with data throughout their business, but unfortunately this is much more difficult than it seems.
The first observation is about the catalyst – the data, Everyone knows that data is bigger and more diverse, but what people underestimate is just how inaccessible and siloed they are. The reason that the volume and the variety of the data is growing so fast, is because now you have many more ways to generate data – it’s gone beyond just web servers or enterprise resource planning systems. Today, it’s the electronic medical records at your doctor’s office, connected sensors embedded in transformers in an electrical substation, or even more outrageously – a fusion of medical records and connected sensors in the form of fitness trackers that you wear every minute of the day. And in every instance, new data stores are being instantiated in all corners of the business faster than you can ever imagine. So yes, storage is a problem, but that’s not even _the_ problem The real problem at the enterprise level is how to catalog, organize, secure, govern this complex federation of data.
Next, let’s talk about AI AI is a loose collection of many different algorithms that allows machines to make predictions, or make decisions. It’s a game-changer once developed, it can automate complex tasks, or aid human decision making. There are many varieties of algorithms at our disposal today, and more are being developed constantly. The challenge to building great AI – in addition to having the right data of course, is to pick the right algorithm for the problem How would you know what’s the right algorithm? Well, it’s hard to say, you may have to try a few different approaches. Certainly when you have many use cases, there is unlikely going to be a single approach that can use used everywhere. This means that problem is not just getting one algorithm to work, but to have a way of applying many different types of algorithms depending on the context.
Finally, let’s talk about all functional roles that’s involved in making every use case successful. This is probably the most often over-looked element in this whole equation. In every enterprise data use case, many different teams must work together seamlessly to be successful. What I mean by work together is that: You first need to business context – someone who has the domain knowledge You then need the experts who can bring the data together – the data integration, cleansing, all in a reliable and timely way You need people who can systematically use the data to derive answers, or use algorithms to build models that derive the answers These different roles exist because today’s enterprises, and their business models are so vast and complex, not a single team can do all these jobs. The data engineering team need to
Typically people start with the data warehouse. It was created to solve a very narrow and specific problem: When data is very structured and you give business analysts a way to use data for decision making. It has many limitations: First, it does not scale up to big data - only a small percentage of enterprise data used in decision-making Second, the data warehouse does not offer a way to build AI, so there is no way to automate decision-making. Business still have to rely on a handful of business analysts to manually sift through the data, build dashboards or create reports to support the business.
Typically people start with the data warehouse. It was created to solve a very narrow and specific problem: When data is very structured and you give business analysts a way to use data for decision making. It has many limitations: First, it does not scale up to big data - only a small percentage of enterprise data used in decision-making Second, the data warehouse does not offer a way to build AI, so there is no way to automate decision-making. Business still have to rely on a handful of business analysts to manually sift through the data, build dashboards or create reports to support the business.
Instead of centralizing data and building a complex zoo of tools on top of single storage system, there is another approach Separate compute and storage The new approach uses a flexible compute layer to: Connect to different data stores without migrating data, manage metadata across silos Run diverse workloads to support a wide range of analytics approaches Provide simplified interfaces for users with different skillsets and objectives Effectively, we want to virtualize the analytics layer
Viacom is the parent company of MTV and Nickelodeon. It is one of the largest media companies in the world, its content is broadcasted in more than 160 countries. Delivering high-quality video and growing the engagement of the viewers is the core mission of Viacom.

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Similar to Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli

Editor's Notes