Spark SQL at Big Data TechCon 2015

About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization

Todays Agenda
General overview of Spark
Spark and Hadoop
Key Concepts
Deployment Options
SQL and Dataframes API
Hands-on Exercises/Demo

About SPARK!
• General Cluster Computing
• Deployment Options
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014

More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• Many flexible options for processing data

Plenty of Options
• API’s in Java, Scala, Python, R
• GraphX
• Streaming
• MLlib
• SQL goodness

Current State of Spark
• Now in version 1.3
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark

So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc..
• Hadoop Map Reduce is fairly mature

..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety  intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive

But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing  like
machine learning
• Challenging to conduct interactive analysis  run job –
go get coffee

..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce  split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!

So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon

Right Now Hadoop and Spark are friends
• There are other files systems
• There are other resource managers
• Mesos
• In a couple of years Spark and Hadoop may be in
competition

About RDD’s
• Read only partitioned collection of data
• In-memory*
• Provide a high level abstraction for interacting with
distributed datasets
Node1 Node 2
Partition 1 Partition 2 Partition 3 Partition 4

Spark Execution
Driver Program: Responsible for coordinating
execution and collecting results
Workers: where the actual work gets done!

Building a Data Pipeline
Basic operations in a Spark Data Pipeline
• Load data to RDD
• Perform Transformations  manipulate and create new datasets
from existing ones
• Actions  return or store data
Spark uses lazy evaluation – no transformations are
applied until there is an action

Deploying Spark
Where:
• On-premise
• Databricks
• AWS EMR and EC2
Resource Manager:
• Local
• Yarn (Hadoop)
• Mesos

SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?) 
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]

Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE (PyCharm)
• Install on your Mac with one command
brew install apache-spark

Spark SQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations

What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy  more community
participation

We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates

But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program

How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools

SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext

Dataframes
• Schema RDD renamed Dataframe in version 1.3
• Modeled after R Dataframes and the popular Python
library Pandas
• Another example of making powerful data processing
even more accessible.

Spark Configuration
• SparkConf parameters of your application and execution
• Master connection
• Cores and Memory
• Application name
• SparkContext – a connection to the Spark Execution Engine

Loading data
• sc.textFile – loads text files to an RDD, iterator per line of text
file
• sc.wholeTextFiles – loads text files to an RDD (key is name,
value is contents), iterator per text file
• Row – creates a “Row” of data with Schema
• sqContext.inferSchema – Create Dataframe from RDD with
Row Class applied
• sqlContext.jsonFile– loads a json file directly to a Dataframe

SQL Fun
• registerTempTable – register an dataframe as a
temp table for SQL fun
• sqlContext.sql – allows you to execute SQL
statements via Spark
• sqlContext.registerFunction – create a UDF
callable within Spark SQL

Partitioning
• repartition – increate or decrease the number of
partitions
• rdd.getNumPartitions – project dataframe as RDD and
get number of partitions

Inferring Schema and Querying JSON

Another method – load directly in SQL

And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brought us a data sources API:
• Much easier to develop new integrations
• New integrations underway  Cassandra, CSV, Avro

Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements  Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency

Elliott Cordo
Chief Architect, Caserta Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Thank You

Spark SQL at Big Data TechCon 2015

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Mehr von Caserta

Mehr von Caserta (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Spark SQL at Big Data TechCon 2015