Weitere ähnliche Inhalte
Ähnlich wie Intro to Spark & Zeppelin - Crash Course - HS16SJ (20)
Mehr von DataWorks Summit/Hadoop Summit (20)
Kürzlich hochgeladen (20)
Intro to Spark & Zeppelin - Crash Course - HS16SJ
- 11. 11 © Hortonworks Inc. 2011 –2016. All Rights Reserved
RDD - Resilient Distributed Dataset
à Primary abstraction in Spark
– An Immutable collection of objects (or records, or elements) that can be operated on in parallel
à Distributed
– Collection of elements partitioned across nodes in a cluster
– Each RDD is composed of one or more partitions
– User can control the number of partitions
– More partitions => more parallelism
à Resilient
– Recover from node failures
– An RDD keeps its lineage information -> it can be recreated from parent RDDs
à Created by starting with a file in Hadoop Distributed File System (HDFS) or an existing
collection in the driver program
à May be persisted in memory for efficient reuse across parallel operations (caching)
- 21. 21 © Hortonworks Inc. 2011 –2016. All Rights Reserved
SQL Example
// Register Temporary Table
df.registerTempTable("flights")
// Use SQL to Query Dataset
sqlContext.sql("SELECT Origin, Dest, DepDelay
FROM flights
WHERE DepDelay > 15 LIMIT 5").show
Using SQL to Query and Filter Data (again, show delays more than 15 min)
+------+----+--------+
|Origin|Dest|DepDelay|
+------+----+--------+
| IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+
- 38. 38 © Hortonworks Inc. 2011 –2016. All Rights Reserved
Access patterns enabled by YARN
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Hadoop Distributed File System
Interactive Real-TimeBatch
Applications Batch
Needs to happen but, no
timeframe limitations
Interactive
Needs to happen at
Human time
Real-Time
Needs to happen at
Machine Execution time.
- 40. 40 © Hortonworks Inc. 2011 –2016. All Rights Reserved
Why HDFS?
Fault Tolerant Distributed Storage
• Divide files into big blocks and distribute 3 copies randomlyacross the cluster
• Processing Data Locality
• Not Just storage but computation
10110100101
00100111001
11111001010
01110100101
00101100100
10101001100
01010010111
01011101011
11011011010
10110100101
01001010101
01011100100
11010111010
0
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
- 41. 41 © Hortonworks Inc. 2011 –2016. All Rights Reserved
There’s more to HDP
YARN : Data Operating System
DATA ACCESS SECURITY
GOVERNANCE &
INTEGRATION OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
N
Data Lifecycle &
Governance
Falcon
Atlas
Administration
Authentication
Authorization
Auditing
Data Protection
Ranger
Knox
Atlas
HDFS EncryptionData Workflow
Sqoop
Flume
Kafka
NFS
WebHDFS
Provisioning,
Managing, &
Monitoring
Ambari
Cloudbreak
Zookeeper
Scheduling
Oozie
Batch
MapReduce
Script
Pig
Search
Solr
SQL
Hive
NoSQL
HBase
Accumulo
Phoenix
Stream
Storm
In-memory Others
ISV Engines
Tez Tez Slider Slider
DATA MANAGEMENT
Hortonworks Data Platform 2.4.x
Deployment ChoiceLinux Windows On-Premise Cloud
HDFS Hadoop Distributed File System