Elliott Cordo, Chief Architect at Caserta Concepts, gave this workshop at Big Data TechCon in Boston.
Spark, born in the UC Berkeley AMPLab around 2009, is the ultra-fast, general-purpose big data-computing platform that provides some very flexible options for processing and accessing data. It was open-sourced in 2010, adopted by the Apache Foundation in 2013, and in 2014 became a priority technology tool, unifying access to structured data from a variety of sources. Spark SQL is Spark’s SQL-based API that allows users to process data using Spark’s powerful in-memory engine with familiar SQL language.
This class provided a general overview of Spark, including the concept of a RDD (redundant distributed dataset), and the available APIs (Java, Python, etc.). They then did a deep-dive into Spark SQL to become comfortable with:
- Schema RDD: The powerful storage concept underlying Spark SQL
- A detailed overview of the Spark SQL API
- Spark SQL for interactive analysis via shell and tools like iPython Notebook
- Spark’s integration with the Hive Metastore
- Accessing Spark through JDBC
- Building ETL Jobs using Spark: How to use SQL and Code together to get the best of both worlds
- Spark on AWS Elastic Map Reduce (EMR)
This was a hands-on class to learn about this exciting new SQL engine and how it can be used for ETL and interactive queries.
Here is the repository for the in-class exercise:
https://github.com/Caserta-Concepts/spark-techcon15
For more information, visit us at www.casertaconcepts.com
2. About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
3. Todays Agenda
General overview of Spark
Spark and Hadoop
Key Concepts
Deployment Options
SQL and Dataframes API
Hands-on Exercises/Demo
4. About SPARK!
• General Cluster Computing
• Deployment Options
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014
5. More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• Many flexible options for processing data
6. Plenty of Options
• API’s in Java, Scala, Python, R
• GraphX
• Streaming
• MLlib
• SQL goodness
7. Current State of Spark
• Now in version 1.3
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark
8. So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc..
• Hadoop Map Reduce is fairly mature
11. ..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive
12. But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing like
machine learning
• Challenging to conduct interactive analysis run job –
go get coffee
13. ..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!
14. So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon
15. Right Now Hadoop and Spark are friends
• There are other files systems
• There are other resource managers
• Mesos
• Spark Standalone
• In a couple of years Spark and Hadoop may be in
competition
17. About RDD’s
• Read only partitioned collection of data
• In-memory*
• Provide a high level abstraction for interacting with
distributed datasets
Node1 Node 2
Partition 1 Partition 2 Partition 3 Partition 4
18. Spark Execution
Driver Program: Responsible for coordinating
execution and collecting results
Workers: where the actual work gets done!
19. Building a Data Pipeline
Basic operations in a Spark Data Pipeline
• Load data to RDD
• Perform Transformations manipulate and create new datasets
from existing ones
• Actions return or store data
Spark uses lazy evaluation – no transformations are
applied until there is an action
22. SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?)
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]
23. Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE (PyCharm)
• Install on your Mac with one command
brew install apache-spark
25. Spark SQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations
26. What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy more community
participation
27. We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates
28. But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program
29. How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools
30. SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext
31. Dataframes
• Schema RDD renamed Dataframe in version 1.3
• Modeled after R Dataframes and the popular Python
library Pandas
• Another example of making powerful data processing
even more accessible.
33. Spark Configuration
• SparkConf parameters of your application and execution
• Master connection
• Cores and Memory
• Application name
• SparkContext – a connection to the Spark Execution Engine
34. Loading data
• sc.textFile – loads text files to an RDD, iterator per line of text
file
• sc.wholeTextFiles – loads text files to an RDD (key is name,
value is contents), iterator per text file
• Row – creates a “Row” of data with Schema
• sqContext.inferSchema – Create Dataframe from RDD with
Row Class applied
• sqlContext.jsonFile– loads a json file directly to a Dataframe
35. SQL Fun
• registerTempTable – register an dataframe as a
temp table for SQL fun
• sqlContext.sql – allows you to execute SQL
statements via Spark
• sqlContext.registerFunction – create a UDF
callable within Spark SQL
36. Partitioning
• repartition – increate or decrease the number of
partitions
• rdd.getNumPartitions – project dataframe as RDD and
get number of partitions
43. And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brought us a data sources API:
• Much easier to develop new integrations
• New integrations underway Cassandra, CSV, Avro
45. Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency