SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Spark Webinar 
October 2nd, 2014 
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Vinay Shukla & Ram Venkatesh
Agenda 
• What is Spark? 
• What have we done with Spark so far 
• Tech Previews 
• Brief on Spark 1.1.0 Tech Preview 
• Multi tenant & multi workload with YARN 
• Introducing Spark-3561 
• Get Involved 
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Let’s Talk About Apache Spark 
What is Spark? 
• Spark is a general-purpose big data engine that provides simple APIs for data scientists and 
engineers familiar with Scala, Python and Java to build ad-hoc interactive analytics, iterative 
machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to 
TB sized datasets. 
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
What is Spark? 
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
(Hadoop|FlatMapped|Filter|MapPartitions|Shuffled)RDD 
stage0: (Hadoop|FlatMapped|Filter|MapPartitions)RDD 
stage1: ShuffledRDD 
ShuffleMapTask: (flatMap | 
map) 
Task Task 
ResultTask: (reduceByKey) 
ShuffleMapTask: (flatMap | 
map) 
Spark API 
Spark 
Compiler / Optimizer 
DAG Runtime 
Execution Engine 
Spark Cluster YARN Mesos 
Client 
Cluster 
DAGScheduler, ActiveJob 
Task 
SparkAM
Let’s Talk About Apache Spark (cont’d) 
What’s Our Spark Strategy? 
• Hortonworks is focused on enabling Spark for Enterprise Hadoop so users can deploy Spark-based 
applications along with their other Hadoop workloads in a consistent, predictable, and robust way. 
– Leverage Scale, Multi-tenancy provided by “YARN” so its memory and CPU intensive apps can work with 
predictable performance 
– Integrate it with HDP’s operations, security, governance, scalability, availability, and multi-tenancy capabilities 
Do We Have a Plan to Support Spark? Yes. 
• Spark is available now as a Technology Preview. 
• We are working our standard process of Tech Preview -> GA. We did this for Storm, Falcon, etc. 
• Spark will be added to our HDP Enterprise Plus subscription when it’s GA ready 
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Timeline 
Break-down 
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Roadmap 
2014 JULY SEPT 
1.0.1 TP 
Refresh 
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
1.1.0 TP 
Refresh 
DEC 
1.2.0 GA 
• Hive 13 support 
• Limited ORC support 
• Spark on YARN: Deployment Best Practices 
• Ambari Support for Spark Install/Stop/Config 
• Spark on Kerberized Cluster 
• Authentication against LDAP in Spark UI
What’s in Spark 1.1.0 Tech Preview 
• Upgrades Spark to Hive .13 
• Provides Hive .13 features (new Hive UDFs) in Spark 
• Limited ORC support 
• Ability to manipulate ORC as HadoopRDD 
….. 
val inputRead = 
sc.hadoopFile("hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/orc_table",cl 
assOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],classOf[org.apache.hadoop.io. 
NullWritable],classOf[org.apache.hadoop.hive.ql.io.orc.OrcStruct]) 
val k = inputRead.map(pair => pair._2.toString) 
val c = k.collect 
….. 
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Enterprise Readiness 
Enhancements 
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark Investment Phases 
• Phase 1 
• Hive 0.13 support 
• Limited ORC support 
• Security: Spark certification on Kerberized Cluster 
• Security: Authentication in Spark UI against LDAP/AD 
• Operations: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI 
• Phase 2 
• Improve reliability & Scale of Spark-on-YARN 
• Enhance ORC support 
• Improve Debug Capabilities 
• Security: Wire Encryption and Authorization with XA/Argus 
• Operations: Spark logs published to YARN Application Timeline Service (ATS) 
• Operations: Enhanced workload management 
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Spark on Hadoop 
October 2nd, 2014 
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Ram Venkatesh
Spark-on-Hadoop – End User Benefits 
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
• Developer Productivity 
• Simple, easy to use APIs 
• Direct and elegant representation of the data processing flow 
• Focus on application business logic rather than Hadoop internals 
• Integrated develop-deploy-debug experience through the IDE 
• Multi-tenancy 
• Shared infrastructure across workloads – interactive queries by day, batch ETL at night 
• Better utilization of compute capacity 
• Move the execution to the data tier instead of the other way around 
• Reduced load on distributed filesystem (HDFS) 
• Reduce unnecessary replicated reads and writes 
• Reduced network usage 
• Eliminates the need for data transfer in and out of the cluster 
Page 12
Spark-on-Hadoop – Design considerations 
• Don’t solve problems that have already been solved. 
–Leverage discrete task based compute model for elasticity, scalability and fault tolerance 
–Leverage several man years of work in Hadoop Map-Reduce data shuffling operations 
–Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN 
–Leverage built-in security mechanisms in Hadoop for privacy and isolation 
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
• Don’t create new problems 
–Preserve the simple developer experience 
–No changes to Spark programs, all programs run unmodified 
–Propose simple, mainstream in-the-community extension to the Apache Spark project 
Page 13 
Look to the Future with an eye on the Past
Spark on Hadoop – From service model to app model 
Spark jobs compile down to a Directed Acyclic Graph (DAG). 
• Vertices in the graph represent user logic 
• Edges represent data movement from producers to consumers 
• Spark DAG executed using Apache Tez at runtime 
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 14 
Preprocessor Stage 
Partition Stage 
Aggregate Stage 
Sampler 
Task-1 Task-2 
Task-1 Task-2 
Task-1 Task-2 
Samples 
Ranges 
Distributed Sort
Spark-on-Hadoop – Simplifying Operations 
• No deployments to do. No side effects. Easy and safe to try it out! 
• Completely client side application. 
• Simply upload to any accessible FileSystem and point to the cluster through configuration files. 
• Enables running different versions concurrently. Easy to test new functionality while keeping stable 
versions for production. 
• Leverages YARN local resources. 
Spark Client TezTask 
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
TezTask 
© Hortonworks Inc. 2013 
Page 15 
Client 
Machine 
Node 
Manager 
Node 
Manager 
HDFS 
Spark-v1 Spark-v2 
Spark Client 
Client 
Machine
Benefits of native Hadoop execution of Spark DAGs 
• Elastic resource management - dynamic acquisition and release of containers 
•Works with YARN pre-emption, reservation and headroom calculations 
• Auto-parallelism based on sampling – you no longer need to guess no. of reducers 
• Efficient data movement between stages using the Hadoop shuffle 
• Integrates with resource isolation and governance mechanisms in Hadoop 
• Classpath and jarfile management through local resources 
• Detailed job-level metrics through integration with the YARN ATS 
Enables large-scale, multi-tenant batch ETL Spark programs 
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 16
Introducing SPARK-3561 
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DEMO: SPARK-3561 in action 
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SPARK-3561 under the hood 
Example program: 
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SPARK-3561 Demo – contd. 
Execute program using spark-submit 
spark-submit --class dev.demo.WordCount 
--master execution-context:org.apache.spark.tez.TezJobExecutionContext 
spark-on-hadoop-1.0.jar 1 test.txt 
Execute interactive Spark commands through spark-shell 
spark-shell --master execution-context:org.apache.spark.tez.TezJobExecutionContext 
INFO main spark.SparkContext:59 - Will use custom job execution context org.apache.spark.tez.TezJobExecutionContext 
INFO main adapter.SparkToTezAdapter:59 - Adapting PairRDDFunctions.saveAsNewAPIHadoopDataset for Tez 
INFO main repl.SparkILoop:59 - Created spark context.. 
Spark context available as sc. 
scala> 
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
SPARK-3561 – feedback requested 
Provide feedback on your ETL/batch scenarios 
Participate in the discussion on the JIRA 
Try it out when it becomes available 
Looking for early adopters to run and validate at scale 
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Resources 
• Spark Labs Page : http://hortonworks.com/hadoop/spark/ 
• Spark Roadmap Blog : http://hortonworks.com/blog/extending-spark-yarn- 
enterprise-hadoop/ 
• Spark 1.1.0 Tech Preview : http://hortonworks.com/kb/spark-1-1-0- 
technical-preview-hdp-2-1-5/ 
• Public Spark Forums : 
http://hortonworks.com/community/forums/forum/spark/ 
• Spark-3561 : https://issues.apache.org/jira/browse/SPARK-3561 
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A… 
Discussion 
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Was ist angesagt? (20)

Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
The Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral ProcessingThe Unbearable Lightness of Ephemeral Processing
The Unbearable Lightness of Ephemeral Processing
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 

Andere mochten auch

Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
Hortonworks
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Databricks
 

Andere mochten auch (20)

Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Hortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical ApplicationsHortonworks Technical Workshop: HBase For Mission Critical Applications
Hortonworks Technical Workshop: HBase For Mission Critical Applications
 
Hadoop crashcourse v3
Hadoop crashcourse v3Hadoop crashcourse v3
Hadoop crashcourse v3
 
Hortonworks technical workshop operations with ambari
Hortonworks technical workshop   operations with ambariHortonworks technical workshop   operations with ambari
Hortonworks technical workshop operations with ambari
 
Apache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance BenchmarksApache Hive 0.13 Performance Benchmarks
Apache Hive 0.13 Performance Benchmarks
 
Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It! Boost Performance with Scala – Learn From Those Who’ve Done It!
Boost Performance with Scala – Learn From Those Who’ve Done It!
 
Webinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_finalWebinar turbo charging_data_science_hawq_on_hdp_final
Webinar turbo charging_data_science_hawq_on_hdp_final
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1Apache Ambari - What's New in 2.1
Apache Ambari - What's New in 2.1
 
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...Hortonworks Technical Workshop:   HDP everywhere - cloud considerations using...
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Reactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8sReactive Programming Meetup - NodeJs on K8s
Reactive Programming Meetup - NodeJs on K8s
 
Sparkstreaming
SparkstreamingSparkstreaming
Sparkstreaming
 
Devops Spark Streaming
Devops Spark StreamingDevops Spark Streaming
Devops Spark Streaming
 
Scala training workshop 02
Scala training workshop 02Scala training workshop 02
Scala training workshop 02
 
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
IndexedRDD: Efficeint Fine-Grained Updates for RDD's-(Ankur Dave, UC Berkeley)
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 

Ähnlich wie YARN Ready: Apache Spark

Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 

Ähnlich wie YARN Ready: Apache Spark (20)

Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
 
Spark and Hadoop Technology
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Module01
 Module01 Module01
Module01
 
Getting Started with Spark Scala
Getting Started with Spark ScalaGetting Started with Spark Scala
Getting Started with Spark Scala
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
Spark + Hadoop Perfect together
Spark + Hadoop Perfect togetherSpark + Hadoop Perfect together
Spark + Hadoop Perfect together
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Spark Security
Spark SecuritySpark Security
Spark Security
 

Mehr von Hortonworks

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

YARN Ready: Apache Spark

  • 1. Spark Webinar October 2nd, 2014 Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Vinay Shukla & Ram Venkatesh
  • 2. Agenda • What is Spark? • What have we done with Spark so far • Tech Previews • Brief on Spark 1.1.0 Tech Preview • Multi tenant & multi workload with YARN • Introducing Spark-3561 • Get Involved Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 3. Let’s Talk About Apache Spark What is Spark? • Spark is a general-purpose big data engine that provides simple APIs for data scientists and engineers familiar with Scala, Python and Java to build ad-hoc interactive analytics, iterative machine-learning, and other use cases well-suited to interactive, in-memory data processing of GB to TB sized datasets. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 4. What is Spark? Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved (Hadoop|FlatMapped|Filter|MapPartitions|Shuffled)RDD stage0: (Hadoop|FlatMapped|Filter|MapPartitions)RDD stage1: ShuffledRDD ShuffleMapTask: (flatMap | map) Task Task ResultTask: (reduceByKey) ShuffleMapTask: (flatMap | map) Spark API Spark Compiler / Optimizer DAG Runtime Execution Engine Spark Cluster YARN Mesos Client Cluster DAGScheduler, ActiveJob Task SparkAM
  • 5. Let’s Talk About Apache Spark (cont’d) What’s Our Spark Strategy? • Hortonworks is focused on enabling Spark for Enterprise Hadoop so users can deploy Spark-based applications along with their other Hadoop workloads in a consistent, predictable, and robust way. – Leverage Scale, Multi-tenancy provided by “YARN” so its memory and CPU intensive apps can work with predictable performance – Integrate it with HDP’s operations, security, governance, scalability, availability, and multi-tenancy capabilities Do We Have a Plan to Support Spark? Yes. • Spark is available now as a Technology Preview. • We are working our standard process of Tech Preview -> GA. We did this for Storm, Falcon, etc. • Spark will be added to our HDP Enterprise Plus subscription when it’s GA ready Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 6. Spark Timeline Break-down Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 7. Spark Roadmap 2014 JULY SEPT 1.0.1 TP Refresh Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 1.1.0 TP Refresh DEC 1.2.0 GA • Hive 13 support • Limited ORC support • Spark on YARN: Deployment Best Practices • Ambari Support for Spark Install/Stop/Config • Spark on Kerberized Cluster • Authentication against LDAP in Spark UI
  • 8. What’s in Spark 1.1.0 Tech Preview • Upgrades Spark to Hive .13 • Provides Hive .13 features (new Hive UDFs) in Spark • Limited ORC support • Ability to manipulate ORC as HadoopRDD ….. val inputRead = sc.hadoopFile("hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/orc_table",cl assOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFormat],classOf[org.apache.hadoop.io. NullWritable],classOf[org.apache.hadoop.hive.ql.io.orc.OrcStruct]) val k = inputRead.map(pair => pair._2.toString) val c = k.collect ….. Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 9. Spark Enterprise Readiness Enhancements Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 10. Spark Investment Phases • Phase 1 • Hive 0.13 support • Limited ORC support • Security: Spark certification on Kerberized Cluster • Security: Authentication in Spark UI against LDAP/AD • Operations: Ambari Stack Definition: Install/Start/Stop/Config/Quick links to Spark UI • Phase 2 • Improve reliability & Scale of Spark-on-YARN • Enhance ORC support • Improve Debug Capabilities • Security: Wire Encryption and Authorization with XA/Argus • Operations: Spark logs published to YARN Application Timeline Service (ATS) • Operations: Enhanced workload management Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 11. Spark on Hadoop October 2nd, 2014 Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Ram Venkatesh
  • 12. Spark-on-Hadoop – End User Benefits Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 • Developer Productivity • Simple, easy to use APIs • Direct and elegant representation of the data processing flow • Focus on application business logic rather than Hadoop internals • Integrated develop-deploy-debug experience through the IDE • Multi-tenancy • Shared infrastructure across workloads – interactive queries by day, batch ETL at night • Better utilization of compute capacity • Move the execution to the data tier instead of the other way around • Reduced load on distributed filesystem (HDFS) • Reduce unnecessary replicated reads and writes • Reduced network usage • Eliminates the need for data transfer in and out of the cluster Page 12
  • 13. Spark-on-Hadoop – Design considerations • Don’t solve problems that have already been solved. –Leverage discrete task based compute model for elasticity, scalability and fault tolerance –Leverage several man years of work in Hadoop Map-Reduce data shuffling operations –Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN –Leverage built-in security mechanisms in Hadoop for privacy and isolation Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 • Don’t create new problems –Preserve the simple developer experience –No changes to Spark programs, all programs run unmodified –Propose simple, mainstream in-the-community extension to the Apache Spark project Page 13 Look to the Future with an eye on the Past
  • 14. Spark on Hadoop – From service model to app model Spark jobs compile down to a Directed Acyclic Graph (DAG). • Vertices in the graph represent user logic • Edges represent data movement from producers to consumers • Spark DAG executed using Apache Tez at runtime Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 14 Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
  • 15. Spark-on-Hadoop – Simplifying Operations • No deployments to do. No side effects. Easy and safe to try it out! • Completely client side application. • Simply upload to any accessible FileSystem and point to the cluster through configuration files. • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. Spark Client TezTask Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TezTask © Hortonworks Inc. 2013 Page 15 Client Machine Node Manager Node Manager HDFS Spark-v1 Spark-v2 Spark Client Client Machine
  • 16. Benefits of native Hadoop execution of Spark DAGs • Elastic resource management - dynamic acquisition and release of containers •Works with YARN pre-emption, reservation and headroom calculations • Auto-parallelism based on sampling – you no longer need to guess no. of reducers • Efficient data movement between stages using the Hadoop shuffle • Integrates with resource isolation and governance mechanisms in Hadoop • Classpath and jarfile management through local resources • Detailed job-level metrics through integration with the YARN ATS Enables large-scale, multi-tenant batch ETL Spark programs Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 16
  • 17. Introducing SPARK-3561 Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 18. DEMO: SPARK-3561 in action Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 19. SPARK-3561 under the hood Example program: Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 20. SPARK-3561 Demo – contd. Execute program using spark-submit spark-submit --class dev.demo.WordCount --master execution-context:org.apache.spark.tez.TezJobExecutionContext spark-on-hadoop-1.0.jar 1 test.txt Execute interactive Spark commands through spark-shell spark-shell --master execution-context:org.apache.spark.tez.TezJobExecutionContext INFO main spark.SparkContext:59 - Will use custom job execution context org.apache.spark.tez.TezJobExecutionContext INFO main adapter.SparkToTezAdapter:59 - Adapting PairRDDFunctions.saveAsNewAPIHadoopDataset for Tez INFO main repl.SparkILoop:59 - Created spark context.. Spark context available as sc. scala> Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 21. SPARK-3561 – feedback requested Provide feedback on your ETL/batch scenarios Participate in the discussion on the JIRA Try it out when it becomes available Looking for early adopters to run and validate at scale Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 22. Resources • Spark Labs Page : http://hortonworks.com/hadoop/spark/ • Spark Roadmap Blog : http://hortonworks.com/blog/extending-spark-yarn- enterprise-hadoop/ • Spark 1.1.0 Tech Preview : http://hortonworks.com/kb/spark-1-1-0- technical-preview-hdp-2-1-5/ • Public Spark Forums : http://hortonworks.com/community/forums/forum/spark/ • Spark-3561 : https://issues.apache.org/jira/browse/SPARK-3561 Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 23. Q&A… Discussion Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved