SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Apache Spark - In industry
Dorian Beganovic
About me
• Experience with Spark
• Q-Park - 20 months
• “Big data” - Spark, Hadoop, Data Lake
• Data warehousing - Microsoft SQL Server
• Personal projects
• Machine learning on EEG data (3 months)
• Spark Structured streaming (1 month)
• Really interested in data systems
• All types of databases (relational, parallel, columnar…)
• Big data, cloud, distributed systems
Hadoop
Apache Hadoop
• Open Source framework for distributed storage and processing
• Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella)
• 2006. Yahoo! Created Hadoop based on GFS and MapReduce
• Based on MapReduce programming model Fundamental
assumption - all the modules are built to handle hardware failures
automatically
• Clusters built of commodity hardware
• Pig, Hive, Mahout - optimize Map Reduce
Spark
Apache Spark
• Open source fast and expressive cluster computing framework designed for Big data analytics
• Compatible with Apache Hadoop
• Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache
Software Foundation in 2013.
• Original author - Matei Zaharia
• Databricks inc. - company behind Apache Spark (many other sponsors now)
Who uses Spark?
• In total over 3000 companies use Apache Spark
• Microsoft, Uber, Pinterest, Amazon, Oracle, Cisco, Verizon, Visa…
• https://spark.apache.org/powered-by.html
Why use Big Data tools?
• Complex analysis on 10TB+ of data
• Only use Big Data tools like Spark if your data doesn’t fit on a single machine
• Shuffle operation is extremely expensive (network IO is very slow)
AWS EC2 instance types
Why Spark - 1/4
1. Speed
Speed -1/2
checkouts
Speed - 2/2
2. Ease of use
Why Spark - 2/4
Ease of use
3. Generality
Why Spark - 3/4
Generality
• You can use one framework (Spark) for:
• Processing batch (big) data - Spark SQL
• Processing streaming (big) data - Spark Streaming
• Machine learning at scale - Spark MLLib
• Graph analysis at scale - Graph X
4. Runs everywhere
Why Spark - 4/4
• Access data from anywhere:
• S3, HDFS any JDBC database…
• Runs in:
• Standalone cluster mode
• EC2 (AWS Elastic cloud compute)
• Hadoop YARN
• Apache MESOS
Runs everywhere
• https://www.datanami.com/2017/09/29/hadoop-hard-find-strata-week/
• Currently Hadoop (HDFS) is slowly getting replaced with object storage (AWS S3…) in
the cloud
Spark Architecture
APIs
Spark SQL
• Originally named “Shark” and used to execute Hive queries in-memory
• As of Spark 2.0 - SQL 2003 standard support
• By far the most popular library (you’ll certainly use it for any task)
• ~90% of the codebase
• A lot faster and provides higher level operations than RDDs (based on RDDs )
• API is inspired by Python and R data frames
• Academic paper that introduced Spark SQL
Spark SQL Architecture
API example
• Ability to execute SQL queries is extremely powerful
• The official documentation is a great place to start
Spark Streaming
• Scalable fault tolerant streaming system
• Very high level of abstraction and powerful APIs
• Receivers receive data and chop them into micro-batches (not a single record at a
time)
• Spark processes batches and pushes out the result
• Input: files, Kafka, socket, Kinesis, Flume…
*RDD streaming API will be replaced (deprecated)
Spark Streaming Demo
Spark Structured Streaming Demo
• High-level streaming API built on DataFrames
• Catalyst optimizer creates incremental execution plan
• Unifies streaming, interactive and batch queries
• Supports multiple sources and sinks
• E.g. aggregate data in a stream, then serve using JDBC
• “The simplest way to perform streaming analytics is not having to reason about
streaming.”
• Probably the coolest thing Spark has
Spark Structured Streaming
Spark Structured Streaming Demo
Spark MLLib
• Best solution for distributed machine learning
• Not all algorithms are implemented (some can’t be)
• Really slow on single node or small datasets compared to established libraries
• APIs are very similar to those in scikit-learn (but can be painful to use with Scala
or Java)
• Two APIs
• RDD based (in “maintenance” mode)
• DataFrame
Spark MLLib
Key take-aways
• Don’t use Spark if you don’t need to (“big data”)
• The components and APIs have started consolidating and maturing (so your
knowledge after 6 months won’t be outdated)
• Lots of resources on the internet are outdated so focus only on Spark 2.0 and above
• Spark is the most popular tool for analysis of Big Data and likely to remain so in the
future
• Future of Hadoop is very “cloudy” as more and more workloads are moving into the
cloud (object storage S3)
Useful resources
• Spark home page: https://spark.apache.org/
• Apache Zeppelin notebook: https://zeppelin.apache.org
• Spark Core (Internals): https://www.youtube.com/watch?v=7ooZ4S7Ay6Y
• Spark: The Definitive Guide (released in 2018)
https://www.amazon.com/Spark-Definitive-Guide-Processing-
Simple/dp/1491912219

Weitere ähnliche Inhalte

Was ist angesagt?

Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataLucidworks
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleDomino Data Lab
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianSpark Summit
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataJohn Nestor
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop ToolsXplenty
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
Future of pandas
Future of pandasFuture of pandas
Future of pandasJeff Reback
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchAbhishek Andhavarapu
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Saharaspinningmatt
 
seminar presentation on apache-spark
seminar presentation on apache-sparkseminar presentation on apache-spark
seminar presentation on apache-sparkJawhar Ali
 

Was ist angesagt? (18)

Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up SeattleScala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
Scala and Spark are Ideal for Big Data - Data Science Pop-up Seattle
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Scala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big DataScala and Spark are Ideal for Big Data
Scala and Spark are Ideal for Big Data
 
Sql over hadoop ver 3
Sql over hadoop ver 3Sql over hadoop ver 3
Sql over hadoop ver 3
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
Future of pandas
Future of pandasFuture of pandas
Future of pandas
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Hadoop at ayasdi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
 
Real time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and ElasticsearchReal time analytics using Hadoop and Elasticsearch
Real time analytics using Hadoop and Elasticsearch
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
 
seminar presentation on apache-spark
seminar presentation on apache-sparkseminar presentation on apache-spark
seminar presentation on apache-spark
 

Ähnlich wie Apache Spark in Industry

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupHyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentationRamesh Mudunuri
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topicsValentin Kropov
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...Tim Vaillancourt
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 

Ähnlich wie Apache Spark in Industry (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Hadoop world overview trends and topics
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apac...
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 

Kürzlich hochgeladen

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 

Kürzlich hochgeladen (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Apache Spark in Industry

  • 1. Apache Spark - In industry Dorian Beganovic
  • 2. About me • Experience with Spark • Q-Park - 20 months • “Big data” - Spark, Hadoop, Data Lake • Data warehousing - Microsoft SQL Server • Personal projects • Machine learning on EEG data (3 months) • Spark Structured streaming (1 month) • Really interested in data systems • All types of databases (relational, parallel, columnar…) • Big data, cloud, distributed systems
  • 4. Apache Hadoop • Open Source framework for distributed storage and processing • Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella) • 2006. Yahoo! Created Hadoop based on GFS and MapReduce • Based on MapReduce programming model Fundamental assumption - all the modules are built to handle hardware failures automatically • Clusters built of commodity hardware • Pig, Hive, Mahout - optimize Map Reduce
  • 5.
  • 7. Apache Spark • Open source fast and expressive cluster computing framework designed for Big data analytics • Compatible with Apache Hadoop • Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache Software Foundation in 2013. • Original author - Matei Zaharia • Databricks inc. - company behind Apache Spark (many other sponsors now)
  • 8.
  • 9. Who uses Spark? • In total over 3000 companies use Apache Spark • Microsoft, Uber, Pinterest, Amazon, Oracle, Cisco, Verizon, Visa… • https://spark.apache.org/powered-by.html
  • 10. Why use Big Data tools? • Complex analysis on 10TB+ of data • Only use Big Data tools like Spark if your data doesn’t fit on a single machine • Shuffle operation is extremely expensive (network IO is very slow) AWS EC2 instance types
  • 11. Why Spark - 1/4 1. Speed
  • 14. 2. Ease of use Why Spark - 2/4
  • 17. Generality • You can use one framework (Spark) for: • Processing batch (big) data - Spark SQL • Processing streaming (big) data - Spark Streaming • Machine learning at scale - Spark MLLib • Graph analysis at scale - Graph X
  • 18. 4. Runs everywhere Why Spark - 4/4 • Access data from anywhere: • S3, HDFS any JDBC database… • Runs in: • Standalone cluster mode • EC2 (AWS Elastic cloud compute) • Hadoop YARN • Apache MESOS
  • 19. Runs everywhere • https://www.datanami.com/2017/09/29/hadoop-hard-find-strata-week/ • Currently Hadoop (HDFS) is slowly getting replaced with object storage (AWS S3…) in the cloud
  • 21. APIs
  • 22. Spark SQL • Originally named “Shark” and used to execute Hive queries in-memory • As of Spark 2.0 - SQL 2003 standard support • By far the most popular library (you’ll certainly use it for any task) • ~90% of the codebase • A lot faster and provides higher level operations than RDDs (based on RDDs ) • API is inspired by Python and R data frames
  • 23. • Academic paper that introduced Spark SQL Spark SQL Architecture
  • 24. API example • Ability to execute SQL queries is extremely powerful • The official documentation is a great place to start
  • 25. Spark Streaming • Scalable fault tolerant streaming system • Very high level of abstraction and powerful APIs • Receivers receive data and chop them into micro-batches (not a single record at a time) • Spark processes batches and pushes out the result • Input: files, Kafka, socket, Kinesis, Flume…
  • 26. *RDD streaming API will be replaced (deprecated) Spark Streaming Demo
  • 28. • High-level streaming API built on DataFrames • Catalyst optimizer creates incremental execution plan • Unifies streaming, interactive and batch queries • Supports multiple sources and sinks • E.g. aggregate data in a stream, then serve using JDBC • “The simplest way to perform streaming analytics is not having to reason about streaming.” • Probably the coolest thing Spark has Spark Structured Streaming
  • 30. Spark MLLib • Best solution for distributed machine learning • Not all algorithms are implemented (some can’t be) • Really slow on single node or small datasets compared to established libraries • APIs are very similar to those in scikit-learn (but can be painful to use with Scala or Java) • Two APIs • RDD based (in “maintenance” mode) • DataFrame
  • 32. Key take-aways • Don’t use Spark if you don’t need to (“big data”) • The components and APIs have started consolidating and maturing (so your knowledge after 6 months won’t be outdated) • Lots of resources on the internet are outdated so focus only on Spark 2.0 and above • Spark is the most popular tool for analysis of Big Data and likely to remain so in the future • Future of Hadoop is very “cloudy” as more and more workloads are moving into the cloud (object storage S3)
  • 33. Useful resources • Spark home page: https://spark.apache.org/ • Apache Zeppelin notebook: https://zeppelin.apache.org • Spark Core (Internals): https://www.youtube.com/watch?v=7ooZ4S7Ay6Y • Spark: The Definitive Guide (released in 2018) https://www.amazon.com/Spark-Definitive-Guide-Processing- Simple/dp/1491912219