Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Intro to Spark & Zeppelin
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Background
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Spark?
 Apache Open Source Project - originally developed a...
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Ecosystem
Spark Core
Spark SQL Spark Streaming MLLib GraphX
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark?
 Elegant Developer APIs
– Single environment for data mu...
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
History of Hadoop & Spark
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Spark Basics
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Context
 Main entry point for Spark functionality
 Represent...
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD - Resilient Distributed Dataset
 Primary abstraction in Spark
–...
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD – Resilient Distributed Dataset
Partition
1
Partition
2
Partiti...
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Overview
 Spark module for structured data processing (e...
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
 Conceptually equivalent to a table in relational DB or...
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrames
CSVAvro
HIVE
Spark SQL
Text
Col1 Col2 … … ColN
DataFrame...
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Context and Hive Context
 Entry point into all functionality i...
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Examples
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
val df = sqlContext.table("flightsTbl")
df.select...
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DataFrame Example
df.select("Origin", "Dest", "DepDelay”).filter($"...
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("fligh...
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD vs. DataFrame
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs vs. DataFrames
RDD
DataFrame
 Lower-level API (more control)
...
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Frames are Intuitive
RDD Example
Equivalent Data Frame Example...
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark SQL Optimizations
 Spark SQL uses an underlying optimization...
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin & HDP Sandbox
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin
 Web-based Notebook for interactive analytics
 Us...
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What’s not included with Spark?
Resource Management
Storage
Applica...
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Sandbox
What’s included in the Sandbox?
 Zeppelin
 Latest Hor...
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° °...
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark on YARN?
 Utilize existing HDP cluster infrastructure
 ...
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into bi...
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURI...
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Community Connection
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
community.hortonworks.com
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HCC DS, Analytics, and Spark Related Questions Sample
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Lab Preview
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Link to Tutorials with Lab Instructions
http://tinyurl.com/hwx-intr...
Thank you!
community.hortonworks.com
Nächste SlideShare
Wird geladen in …5
×

Intro to Spark with Zeppelin

1.562 Aufrufe

Veröffentlicht am

Intro to Spark with Zeppelin

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Intro to Spark with Zeppelin

  1. 1. Robert Hryniewicz Data Evangelist @RobHryniewicz Intro to Spark & Zeppelin
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark Background
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is Spark?  Apache Open Source Project - originally developed at AMPLab (University of California Berkeley)  Data Processing Engine - focused on in-memory distributed computing use-cases  API - Scala, Python, Java and R
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Ecosystem Spark Core Spark SQL Spark Streaming MLLib GraphX
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Spark?  Elegant Developer APIs – Single environment for data munging and Machine Learning (ML)  In-memory computation model – Fast! – Effective for iterative computations and ML  Machine Learning – Implementation of distributed ML algorithms – Pipeline API (Spark ML)
  6. 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved History of Hadoop & Spark
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Spark Basics
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Context  Main entry point for Spark functionality  Represents a connection to a Spark cluster  Represented as sc in your code What is it?
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDD - Resilient Distributed Dataset  Primary abstraction in Spark – An Immutable collection of objects (or records, or elements) that can be operated on in parallel  Distributed – Collection of elements partitioned across nodes in a cluster – Each RDD is composed of one or more partitions – User can control the number of partitions – More partitions => more parallelism  Resilient – Recover from node failures – An RDD keeps its lineage information -> it can be recreated from parent RDDs  Created by starting with a file in Hadoop Distributed File System (HDFS) or an existing collection in the driver program  May be persisted in memory for efficient reuse across parallel operations (caching)
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDD – Resilient Distributed Dataset Partition 1 Partition 2 Partition 3 RDD 2 Partition 1 Partition 2 Partition 3 Partition 4 RDD 1 Cluster Nodes
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Overview  Spark module for structured data processing (e.g. DB tables, JSON files)  Three ways to manipulate data: – DataFrames API – SQL queries – Datasets API  Same execution engine for all three  Spark SQL interfaces provide more information about both structure and computation being performed than basic Spark RDD API
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames  Conceptually equivalent to a table in relational DB or data frame in R/Python  API available in Scala, Java, Python, and R  Richer optimizations (significantly faster than RDDs)  Distributed collection of data organized into named columns  Underneath is an RDD
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrames CSVAvro HIVE Spark SQL Text Col1 Col2 … … ColN DataFrame (with RDD underneath) Column Row Created from Various Sources  DataFrames from HIVE: – Reading and writing HIVE tables, including ORC  DataFrames from files: – Built-in: JSON, JDBC, ORC, Parquet, HDFS – External plug-in: CSV, HBASE, Avro  DataFrames from existing RDDs – with toDF()function Data is described as a DataFrame with rows, columns and a schema
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Context and Hive Context  Entry point into all functionality in Spark SQL  All you need is SparkContext val sqlContext = SQLContext(sc) SQLContext  Superset of functionality provided by basic SQLContext – Read data from Hive tables – Access to Hive Functions  UDFs HiveContext val hc = HiveContext(sc) Use when your data resides in Hive
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Examples
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrame Example val df = sqlContext.table("flightsTbl") df.select("Origin", "Dest", "DepDelay").show(5) Reading Data From Table +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 8| | IAD| TPA| 19| | IND| BWI| 8| | IND| BWI| -4| | IND| BWI| 34| +------+----+--------+
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DataFrame Example df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5) Using DataFrame API to Filter Data (show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SQL Example // Register Temporary Table df.registerTempTable("flights") // Use SQL to Query Dataset sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT 5").show Using SQL to Query and Filter Data (again, show delays more than 15 min) +------+----+--------+ |Origin|Dest|DepDelay| +------+----+--------+ | IAD| TPA| 19| | IND| BWI| 34| | IND| JAX| 25| | IND| LAS| 67| | IND| MCO| 94| +------+----+--------+
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDD vs. DataFrame
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDDs vs. DataFrames RDD DataFrame  Lower-level API (more control)  Lots of existing code & users  Compile-time type-safety  Higher-level API (faster development)  Faster sorting, hashing, and serialization  More opportunities for automatic optimization  Lower memory pressure
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Frames are Intuitive RDD Example Equivalent Data Frame Example dept name age Bio H Smith 48 CS A Turing 54 Bio B Jones 43 Phys E Witten 61 Find average age by department?
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark SQL Optimizations  Spark SQL uses an underlying optimization engine (Catalyst) – Catalyst can perform intelligent optimization since it understands the schema  Spark SQL does not materialize all the columns (as with RDD) only what’s needed
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin & HDP Sandbox
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin  Web-based Notebook for interactive analytics  Use Cases – Data exploration and discovery – Visualization – Interactive snippet-at-a-time experience – “Modern Data Science Studio”  Features – Deeply integrated with Spark and Hadoop – Supports multiple language backends – Pluggable “Interpreters”
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s not included with Spark? Resource Management Storage Applications Spark Core Engine Scala Java Python libraries MLlib (Machine learning) Spark SQL* Spark Streaming* Spark Core Engine
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Sandbox What’s included in the Sandbox?  Zeppelin  Latest Hortonworks Data Platform (HDP) – Spark – YARN  Resource Management – HDFS  Distributed Storage Layer – And many more components... YARN Scala Java Python R APIs Spark Core Engine Spark SQL Spark Streaming MLlib GraphX 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N HDFS
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Access patterns enabled by YARN YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° °N HDFS Hadoop Distributed File System Interactive Real-TimeBatch Applications Batch Needs to happen but, no timeframe limitations Interactive Needs to happen at Human time Real-Time Needs to happen at Machine Execution time.
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Spark on YARN?  Utilize existing HDP cluster infrastructure  Resource management – share Spark workloads with other workloads like PIG, HIVE, etc.  Scheduling and queues Spark Driver Client Spark Application Master YARN container Spark Executor YARN container Task Task Spark Executor YARN container Task Task Spark Executor YARN container Task Task
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why HDFS? Fault Tolerant Distributed Storage • Divide files into big blocks and distribute 3 copies randomly across the cluster • Processing Data Locality • Not Just storage but computation 10110100101 00100111001 11111001010 01110100101 00101100100 10101001100 01010010111 01011101011 11011011010 10110100101 01001010101 01011100100 11010111010 0 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4
  31. 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved There’s more to HDP YARN : Data Operating System DATA ACCESS SECURITY GOVERNANCE & INTEGRATION OPERATIONS 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N Data Lifecycle & Governance Falcon Atlas Administration Authentication Authorization Auditing Data Protection Ranger Knox Atlas HDFS EncryptionData Workflow Sqoop Flume Kafka NFS WebHDFS Provisioning, Managing, & Monitoring Ambari Cloudbreak Zookeeper Scheduling Oozie Batch MapReduce Script Pig Search Solr SQL Hive NoSQL HBase Accumulo Phoenix Stream Storm In-memory Others ISV Engines Tez Tez Slider Slider DATA MANAGEMENT Hortonworks Data Platform 2.4.x Deployment ChoiceLinux Windows On-Premise Cloud HDFS Hadoop Distributed File System
  32. 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Community Connection
  33. 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved community.hortonworks.com
  34. 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved community.hortonworks.com
  35. 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HCC DS, Analytics, and Spark Related Questions Sample
  36. 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Lab Preview
  37. 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Link to Tutorials with Lab Instructions http://tinyurl.com/hwx-intro-to-spark
  38. 38. Thank you! community.hortonworks.com

×