SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Presented By:
SQL
About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
Todays Agenda
General overview of Spark
Spark and Hadoop
Key Concepts
Deployment Options
SQL and Dataframes API
Hands-on Exercises/Demo
About SPARK!
• General Cluster Computing
• Deployment Options
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014
More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• Many flexible options for processing data
Plenty of Options
• API’s in Java, Scala, Python, R
• GraphX
• Streaming
• MLlib
• SQL goodness
Current State of Spark
• Now in version 1.3
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark
So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc..
• Hadoop Map Reduce is fairly mature
Recent Achievement: Gray Sort
Spark and Hadoop
..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety  intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive
But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing  like
machine learning
• Challenging to conduct interactive analysis  run job –
go get coffee
..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce  split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!
So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon
Right Now Hadoop and Spark are friends
• There are other files systems
• There are other resource managers
• Mesos
• Spark Standalone
• In a couple of years Spark and Hadoop may be in
competition
Key Concepts
About RDD’s
• Read only partitioned collection of data
• In-memory*
• Provide a high level abstraction for interacting with
distributed datasets
Node1 Node 2
Partition 1 Partition 2 Partition 3 Partition 4
Spark Execution
Driver Program: Responsible for coordinating
execution and collecting results
Workers: where the actual work gets done!
Building a Data Pipeline
Basic operations in a Spark Data Pipeline
• Load data to RDD
• Perform Transformations  manipulate and create new datasets
from existing ones
• Actions  return or store data
Spark uses lazy evaluation – no transformations are
applied until there is an action
Deployment
Options
Deploying Spark
Where:
• On-premise
• Databricks
• AWS EMR and EC2
Resource Manager:
• Local
• Yarn (Hadoop)
• Mesos
• Spark Standalone
SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?) 
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]
Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE (PyCharm)
• Install on your Mac with one command
brew install apache-spark
Spark SQL
And Dataframes
Spark SQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations
What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy  more community
participation
We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates
But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program
How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools
SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext
Dataframes
• Schema RDD renamed Dataframe in version 1.3
• Modeled after R Dataframes and the popular Python
library Pandas
• Another example of making powerful data processing
even more accessible.
API Cheat Sheet
and Examples
Spark Configuration
• SparkConf parameters of your application and execution
• Master connection
• Cores and Memory
• Application name
• SparkContext – a connection to the Spark Execution Engine
Loading data
• sc.textFile – loads text files to an RDD, iterator per line of text
file
• sc.wholeTextFiles – loads text files to an RDD (key is name,
value is contents), iterator per text file
• Row – creates a “Row” of data with Schema
• sqContext.inferSchema – Create Dataframe from RDD with
Row Class applied
• sqlContext.jsonFile– loads a json file directly to a Dataframe
SQL Fun
• registerTempTable – register an dataframe as a
temp table for SQL fun
• sqlContext.sql – allows you to execute SQL
statements via Spark
• sqlContext.registerFunction – create a UDF
callable within Spark SQL
Partitioning
• repartition – increate or decrease the number of
partitions
• rdd.getNumPartitions – project dataframe as RDD and
get number of partitions
Spark SQL from Text File
Spark SQL Loves JSON
Inferring Schema and Querying JSON
Another method – load directly in SQL
Spark SQL + Hive
Python UDF
And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brought us a data sources API:
• Much easier to develop new integrations
• New integrations underway  Cassandra, CSV, Avro
Exercise
Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements  Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency
Community
Elliott Cordo
Chief Architect, Caserta Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Thank You

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark etl
Spark etlSpark etl
Spark etl
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Spark meetup v2.0.5
Spark meetup v2.0.5Spark meetup v2.0.5
Spark meetup v2.0.5
 
Introduction to Spark SQL & Catalyst
Introduction to Spark SQL & CatalystIntroduction to Spark SQL & Catalyst
Introduction to Spark SQL & Catalyst
 
Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)Always Valid Inference (Ramesh Johari, Stanford)
Always Valid Inference (Ramesh Johari, Stanford)
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
 
Exadata 12c New Features RMOUG
Exadata 12c New Features RMOUGExadata 12c New Features RMOUG
Exadata 12c New Features RMOUG
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
What is new on 12c for Backup and Recovery? Presentation
What is new on 12c for Backup and Recovery? PresentationWhat is new on 12c for Backup and Recovery? Presentation
What is new on 12c for Backup and Recovery? Presentation
 
Reactive microservices with play and akka
Reactive microservices with play and akkaReactive microservices with play and akka
Reactive microservices with play and akka
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 

Mehr von Caserta

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 

Mehr von Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Spark SQL at Big Data TechCon 2015

  • 2. About Caserta Concepts • Award-winning technology innovation consulting with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 3. Todays Agenda General overview of Spark Spark and Hadoop Key Concepts Deployment Options SQL and Dataframes API Hands-on Exercises/Demo
  • 4. About SPARK! • General Cluster Computing • Deployment Options • Open sourced in 2010, • Apache Software foundation in 2013 • Became top level project early in 2014
  • 5. More about Spark • A Swiss army knife! • Streaming, batch, and interactive • RDD – Redundant Distributed Datastore • Many flexible options for processing data
  • 6. Plenty of Options • API’s in Java, Scala, Python, R • GraphX • Streaming • MLlib • SQL goodness
  • 7. Current State of Spark • Now in version 1.3 • ~175 active contributors • Most Hadoop distros now support, or are in progress of integrating Spark • Databricks is offering commercial support and fully managed Spark Clusters • Large number of Organizations using Spark
  • 8. So Why talk about Spark • Many competing big data processing platforms, query engines, etc.. • Hadoop Map Reduce is fairly mature
  • 11. ..about Hadoop Map Reduce • We can process very large datasets • split processes across a large number of machines • High recoverability/high safety  intermediate data is written to disk.. • Efficient and generally fast – move processing to data • SQL on Hadoop via Hive
  • 12. But map reduce has it’s downsides • SLOW – disk based intermediate steps (local disk and HDFS) • Especially inefficient for iterative processing  like machine learning • Challenging to conduct interactive analysis  run job – go get coffee
  • 13. ..about Spark • In-memory – eliminates intermediate disk based storage • Performs generalized form of map-reduce  split processes across a large number of machines • Fast enough for interactive analysis • Fault tolerant via lineage tracking • SparkSQL!
  • 14. So do we still need Hadoop No, but yes • Why Hadoop? • YARN • HDFS • Hadoop Map Reduce is mature and will still be appropriate for certain workloads! • Other services! • But you can use other resource managers too: • Mesos • Spark Standalone • And can work with other distributed file systems including: • S3 • Gluster • Tachyon
  • 15. Right Now Hadoop and Spark are friends • There are other files systems • There are other resource managers • Mesos • Spark Standalone • In a couple of years Spark and Hadoop may be in competition
  • 17. About RDD’s • Read only partitioned collection of data • In-memory* • Provide a high level abstraction for interacting with distributed datasets Node1 Node 2 Partition 1 Partition 2 Partition 3 Partition 4
  • 18. Spark Execution Driver Program: Responsible for coordinating execution and collecting results Workers: where the actual work gets done!
  • 19. Building a Data Pipeline Basic operations in a Spark Data Pipeline • Load data to RDD • Perform Transformations  manipulate and create new datasets from existing ones • Actions  return or store data Spark uses lazy evaluation – no transformations are applied until there is an action
  • 21. Deploying Spark Where: • On-premise • Databricks • AWS EMR and EC2 Resource Manager: • Local • Yarn (Hadoop) • Mesos • Spark Standalone
  • 22. SPARK on Elastic Map Reduce • Not currently a packaged application (coming soon?)  Maybe AWS has other plans for Spark? • Easily bootstrapped: • https://github.com/awslabs/emr-bootstrap-actions aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=caserta-1 --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["- v1.2.0.a"]
  • 23. Spark can be run locally too! • Easy for development • Local development is exactly the same as submitting work on a cluster! • IPython Notebook, or your favorite IDE (PyCharm) • Install on your Mac with one command brew install apache-spark
  • 25. Spark SQL • Sparks SQL Engine • Brand new - emerged as alpha in 1.0.1 ~ 1 year old • Converts SQL into RDD operations
  • 26. What happened to Shark • Replaces the for Shark Query engine • All new Catalyst optimizer – Shark leveraged the Hive optimizer • Hadoop Map Reduce optimization rules were not applicable • Writing optimization rules made easy  more community participation
  • 27. We love SQL! • Huge population of highly skilled developers and analysts • Compatible with Tooling • Many operations can easily and efficiently be expressed in SQL • Filters • Joins • Group by’s • Aggregates
  • 28. But sometimes SQL is not the best tool! • Some operations do not fit SQL well • Iteration • Row-by row processing • Other operations that are not set-based/SQL oriented Spark can help! • Spark API • MLLIB – machine learning Blend Spark SQL with other code in the same program
  • 29. How can you leverage SPARK SQL • Batch ETL development • Interactive • Spark Shell (PySpark) • Spark SQL CLI • Thrift Server (JDBC) • Beeline • Query Platforms • BI Tools
  • 30. SPARK SQL can leverage the Hive metastore • Hive Metastore can also be leveraged by a wide array of applications • Spark • Hive • Impala • Pig • Available from HiveContext
  • 31. Dataframes • Schema RDD renamed Dataframe in version 1.3 • Modeled after R Dataframes and the popular Python library Pandas • Another example of making powerful data processing even more accessible.
  • 33. Spark Configuration • SparkConf parameters of your application and execution • Master connection • Cores and Memory • Application name • SparkContext – a connection to the Spark Execution Engine
  • 34. Loading data • sc.textFile – loads text files to an RDD, iterator per line of text file • sc.wholeTextFiles – loads text files to an RDD (key is name, value is contents), iterator per text file • Row – creates a “Row” of data with Schema • sqContext.inferSchema – Create Dataframe from RDD with Row Class applied • sqlContext.jsonFile– loads a json file directly to a Dataframe
  • 35. SQL Fun • registerTempTable – register an dataframe as a temp table for SQL fun • sqlContext.sql – allows you to execute SQL statements via Spark • sqlContext.registerFunction – create a UDF callable within Spark SQL
  • 36. Partitioning • repartition – increate or decrease the number of partitions • rdd.getNumPartitions – project dataframe as RDD and get number of partitions
  • 37. Spark SQL from Text File
  • 39. Inferring Schema and Querying JSON
  • 40. Another method – load directly in SQL
  • 41. Spark SQL + Hive
  • 43. And what about other data sources Out of the box: • Parquet • JDBC Spark 1.2 brought us a data sources API: • Much easier to develop new integrations • New integrations underway  Cassandra, CSV, Avro
  • 45. Where do we think SparkSQL is headed • Spark in general will continue to gain momentum • Increasing number of integrated data stores, file types etc • Optimizer improvements  Catalyst should allow it to evolve very quickly! • Subsequent - Improvements for interactive SQL – better performance, concurrency
  • 47. Elliott Cordo Chief Architect, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Thank You