SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Lessons Learned from Deploying Apache Spark
as a Service on IBM Power Systems in the Cloud
Indrajit (I.P) Poddar, STSM, IBM Systems Technical Strategy
Randy Swanberg, DE, IBM Power Systems Software and Solutions
Please Note:
1
• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole
discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in
making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any
material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
• Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual
throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the
amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Agenda
• Infrastructure considerations for a differentiated cloud data service
• Apache Spark – a popular big data framework
• Lessons learned on an alternative infrastructure with OpenPOWER systems
1. Open source stack for cloud native agile development
2. Management stack for automation and continuous integration
3. Efficient resource allocation and scheduling for multi-tenancy
4. Cloud infrastructure economics
5. Potential of acceleration under the cloud service
• Putting it all together
• Summary / Questions
2
Infrastructure for a differentiated Cloud Data Service
1. Agile Open Source development
experience
• Dynamic and flexible provisioning and
management
• Automated deployment and continuous
integration
2. Cost effective high performance
server infrastructure
3. Economical cloud storage service
with encryption
3
Big Data in the Cloud
Example use cases, architectures and components..
Big Data Journey
5
Operations Data
Warehouse
Insight Inspired
Decision Making
Insight Driven
Business
Transformation
Value
Big-Data Maturity
• Cheaper Storage
• Data Lake
• ETL Offload
• Cold Data Offload
• Queryable Archive
• Full Data Analysis (not
just samples)
• Extract Value from
non-relational data
• View of all enterprise
data
• Exploratory Analysis
and Discovery
• New Business Models
• Real time risk aware
decision making
• Real time fraud and
threat detection
• Optimize operations
• Attract and Retain
Customers
Most are somewhere here
Demand for Business Value from ALL Data Sources
Transaction and
application data
Machine,
sensor data
Enterprise
content
Image, geospatial,
video
Social data
Third-party data
Deep
Analytics
data zone
EDW and
data mart
zone
New Customer
Insights
Discover
Relationship
Risk Aware
Decisions
Early Warnings
New Business
Opportunities
Fraud Detection
What is Apache Spark?
• Unified Analytics Platform
– Combine streaming, graph, machine learning
and sql analytics on a single platform
– Simplified, multi-language programming model
– Interactive and Batch
• In-Memory Design
– Pipelines multiple iterations on single copy of
data in memory
– Superior Performance
– Natural Successor to MapReduce
7
Fast and general engine for
large-scale data processing
Spark Core API
R Scala SQL Python Java
Spark SQL Streaming MLlib GraphX
Anatomy of Apache Spark as a
Cloud Service
Stack components, web services, continuous
integration..
Interactive iPython notebook with Matplotlib GUI
9
Host interactive analytic apps on ipython server to ease code sharing and reuse
Architectural components of a Apache Spark Cloud Service
10
Object Store for dataPlatform as a Service
Multi-tenant Spark drivers
and executors
Multi-tenant interactive Jupyter
ipython notebook servers with
matplotlib GUI
Shared Compute Cluster
Infrastructure as a Service
Bare metals Virtual Machines
Continuous Integration
Build Deploy
Shared storage IBM Spectrum Scale (GPFS)
Test
A prototype deployment on POWER systems
11
Goal: Production deployment
IBM Power System S812LC and S822LC, Tyan OpenPOWER
Development environment
Continuous Integration
Deployment Automation
Bare metals
Virtual Machines
Docker Containers
Docker containers
Lessons Learned
Open source stack, efficient resource allocation and continuous
integration, better economics and potential for acceleration on
OpenPOWER systems ..
Analytics open source stack for agile development
13
OpenStack on POWER for Dev-Test environments
14
We used IBM Cloud Manager version 4.3 with OpenStack Kilo on PowerKVM 2.1.1
IBM Cloud Orchestrator is another option
Continuous Integration with IBM Urban Code, OpenStack and Docker
targeting POWER systems
15
Create multiple deployment and development environments and visual deployment processes in IBM Urban Code Deployment
Run only UCD agents and relay on POWER VMs or bare metals
Continuous Integration Flows
16
Urban Code
Deploy
Server (x86)
Git Server
(x86)
Asset
Repository
(x86)
Dev-Test Env
(OpenPOWER OpenStack VMs)
Build Env
(POWER Docker Containers)
Future Production Env
(OpenPOWER Bare metals)
1. Check in automation code
2. Build artifacts in Docker
3. Store built artifacts in a repository
4. Pull in artifacts into deployment automation
5. Deploy artifacts into dev-test env
6. Deploy artifacts into prod env
Efficient resource allocation using Platform Symphony
17
• Share system resources (CPU, memory)
with a distributed scheduler
• Platform Symphony with Application
Service Controller (ASC) V7.1.1 and the
EGO scheduler for Ubuntu 14.04.2 on
POWER
• Platform Symphony + ASC features
• Fine-grained scheduling
• Resource reclaim
• Standby service
• Data locality
• Shared storage through IBM Spectrum
Scale™
Economics of the Server Infrastructure
• Attributes directly influencing the economics
of hosting a Cloud Service:
• Number of servers needed to deliver a
competitive quality of service and response
time
• Cost of the individual servers or rental
• Number of users and jobs that can be
hosted concurrently (multi-tenancy)
• End user price charged for the service
• All of which are directly impacted by the
performance of the Server
18
Measuring Performance of Spark on POWER
19
The following charts show Performance
results of comparing multiple Spark Workloads
from SparkBench using data sizes from
100GB to 10TB
(https://github.com/SparkTC/spark-bench)
7-node cluster of Intel Haswell servers
• E5-2620 V3
• 12-core
• 256GB
vs
7-node cluster of POWER servers
• POWER8 S812LC
• 10-core
• 256GB
• Machine Learning (Spark MLlib)
• Matrix Factorization
• Logistic Regression
• Support Vector Machine
• SQL (Spark SQL)
sqlContext.sql("SELECT COUNT(*) FROM orderTab").count()
sqlContext.sql("SELECT COUNT(*) FROM orderTab where bid>5000").count()
sqlContext.sql("SELECT * FROM oitemTab WHERE price>250").count()
sqlContext.sql("SELECT * FROM oitemTab WHERE price>500").count()
sqlContext.sql("SELECT * FROM orderTab r JOIN oitemTab s ON r.oid =
s.oid").count()
• Graph (Spark GraphX)
• Page Rank
• Triangle Count
• Singular Value Decomp++
System Performance of Spark on POWER
20
Machine Learning SQL Graph
1.7X
Raw System Performance
Options for the Service Provider:
• Deliver higher qualities of service
• 70% faster job completion
times on average
• Faster time-to-insight
• Charge higher premium for the service
• Competitive advantage for the service
21
Per Core Performance of Spark on POWER
Per-core Performance View
• 70 POWER8 Cores vs. 84 Intel Cores
• Enables headroom for better system
resource utilization
Machine Learning SQL Graph
2X
22
Price Performance of Spark on POWER
Machine Learning SQL Graph
1.5X
Price Performance View*
Options for the Service Provider:
• Spend 33% less on infrastructure
supporting the same amount of workload
• Spend the same on infrastructure but
host 50% more workload
• Lower the price for the service for
competitive advantage
* - based on preliminary SoftLayer pricing targets – subject to change
But wait….there’s more to the story
23
0
0.5
1
1.5
2
2.5
3
E5-2620
v3100GB
M
at.Fact.
100GB
(in
m
em
)LR
1TB
(in
m
em
)LR
1TB
(50/50)LR1TB
SVM
10TB
LR
1TB
5
query
2TB
5
query130GB
Page
Rank
1TB
Triangle
Cnt
1TB
SVD++
AVERAGE
RelativeSystemPerformance
Spark Workloads
• A Deeper Look at the System
Performance profile for one of the
workloads close to our overall
average relative performance
• Machine Learning Logistic
Regression on a 1TB data set that
had a relative performance of 1.74X
Machine Learning SQL Graph
1.7X
More efficient use of resources (Spark 1TB Logistic Regression Example)
24
0
20
40
60
80
100
22:44
22:44
22:45
22:45
22:46
22:46
22:47
22:47
22:48
22:48
22:49
22:49
22:50
22:50
22:51
22:51
22:52
22:52
22:53
22:53
22:54
22:54
22:55
CPU POWER
User% Sys% Wait%
0
20
40
60
80
100
10:14
10:14
10:15
10:16
10:16
10:17
10:18
10:18
10:19
10:20
10:20
10:21
10:22
10:22
10:23
10:24
10:24
10:25
10:26
10:26
10:27
10:28
10:28
CPU Haswell
User% Sys% Wait%
-1500
-1000
-500
0
500
1000
22:44
22:44
22:45
22:45
22:46
22:46
22:47
22:47
22:48
22:48
22:49
22:49
22:50
22:50
22:51
22:51
22:52
22:52
22:53
22:53
22:54
22:54
22:55
MB/sec
Network I/O POWER
Total-Read Total-Write (-ve)
-600
-500
-400
-300
-200
-100
0
100
200
300
400
10:14
10:14
10:15
10:15
10:16
10:17
10:17
10:18
10:18
10:19
10:19
10:20
10:21
10:21
10:22
10:22
10:23
10:24
10:24
10:25
10:25
10:26
10:26
10:27
10:28
10:28
MB/sec
Network I/O Haswell
Total-Read Total-Write (-ve)
POWER
• CPU headroom to
host higher density
• More data pushed
over network due to
higher thread density
Haswell
• CPU fully pegged on
just this workload
• Underutilizing the
Network Resource
0
50000
100000
150000
200000
250000
300000
350000
400000
Runtime(ms)
Total Heap Memory
x Degrees of Separation on
Spark
Disk
CAPI/Flash
25
CAPI Flash for RDD Cache = 4X memory
reduction at equal performance
Next Steps - Acceleration in the Cloud
RDMA for Spark Shuffle = 30% Better
Response Time, Lower CPU Utilization
• CAPI Flash and RDMA can be Leveraged Transparently to Spark Applications under the Cloud Service
• Coming…. HDFS CAPI FPGA Erasure Code Acceleration, CAPI FPGA Compression Acceleration, ….
26
Acceleration of Spark with GPUs:
• Adverse Drug Reaction Prediction built on Spark
• 25X Speed up for Building Model stage (using Spark Mllib Logistic Regression)
• Again, Transparent to the Spark Application
• Game changer for Personalized Medicine
More efficient, cost effective, balanced cloud resources
• Better quality of service through workload acceleration and real time insights
• Efficient scale out architecture avoiding imbalanced resources
• New controls to balance resource utilization
27
GPUs and FPGAs for Compute
offload, consolidation, specialized
acceleration
CAPI Flash for Memory
consolidation/expansion, and Storage
acceleration
RDMA for better latency, better
network utilization and lower CPU
utilization
Fostering Acceleration in the Cloud : SuperVessel for OpenPOWER
ptopenlab.com
28
Putting it All Together….
29
Object Store for dataPlatform as a Service
Multi-tenant Spark drivers
and executors
Multi-tenant interactive Jupyter
ipython notebook servers with
matplotlib GUI
Shared Compute Cluster
Infrastructure as a Service
Bare metals Virtual Machines
Continuous Integration
Build Deploy
Shared storage IBM Spectrum Scale (GPFS)
Test
Docker
Summary
30
• Big Data solutions in the Cloud demand elasticity and scale
• Real time insights from all sources of data will become the norm
• Try open source Apache Spark with IBM Platform Symphony
• OpenPOWER systems can differentiate your cloud data service through:
• Improved cloud infrastructure economics and cost performance advantage
• An agile open source development experience
• Advanced forms of acceleration in cloud infrastructures will further differentiate
services
Notices and Disclaimers
31
Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission
from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of
initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS
DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE
USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY.
IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers
have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in
which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials
and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or
their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and
interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law
Notices and Disclaimers Con’t.
32
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not
tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the
ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT
NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual
property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®,
FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG,
Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®,
PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®,
StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business
Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
Thank You
Your Feedback is Important!
Access the InterConnect 2016 Conference Attendee
Portal to complete your session surveys from your
smartphone,
laptop or conference kiosk.

Weitere ähnliche Inhalte

Was ist angesagt?

Delivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devicesDelivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devicesAjeet Singh Raina
 
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...NETWAYS
 
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...Docker, Inc.
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platforminside-BigData.com
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsLuciano Resende
 
HA Kubernetes on Mesos / Marathon
HA Kubernetes on Mesos / MarathonHA Kubernetes on Mesos / Marathon
HA Kubernetes on Mesos / MarathonCobus Bernard
 
DCSF19 Containers for Beginners
DCSF19 Containers for BeginnersDCSF19 Containers for Beginners
DCSF19 Containers for BeginnersDocker, Inc.
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewKrishna-Kumar
 
[Open infra] how to calculate the cloud system operating rate
[Open infra] how to calculate the cloud system operating rate[Open infra] how to calculate the cloud system operating rate
[Open infra] how to calculate the cloud system operating rateNalee Jang
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsScaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsLuciano Resende
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetescraigbox
 
Webinar: OpenStack Benefits for VMware
Webinar: OpenStack Benefits for VMwareWebinar: OpenStack Benefits for VMware
Webinar: OpenStack Benefits for VMwarePlatform9
 
Optimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deploymentsOptimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deploymentsAnimesh Singh
 
DCEU 18: Automating Docker Enterprise: Hands-off Install and Upgrade
DCEU 18: Automating Docker Enterprise: Hands-off Install and UpgradeDCEU 18: Automating Docker Enterprise: Hands-off Install and Upgrade
DCEU 18: Automating Docker Enterprise: Hands-off Install and UpgradeDocker, Inc.
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesSlideTeam
 
Using Docker For Development
Using Docker For DevelopmentUsing Docker For Development
Using Docker For DevelopmentLaura Frank Tacho
 

Was ist angesagt? (20)

Delivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devicesDelivering Container-based Apps to IoT Edge devices
Delivering Container-based Apps to IoT Edge devices
 
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
OSDC 2018 | Lifecycle of a resource. Codifying infrastructure with Terraform ...
 
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
 
Tesla Accelerated Computing Platform
Tesla Accelerated Computing PlatformTesla Accelerated Computing Platform
Tesla Accelerated Computing Platform
 
OpenStack Icehouse Overview
OpenStack Icehouse OverviewOpenStack Icehouse Overview
OpenStack Icehouse Overview
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
 
HA Kubernetes on Mesos / Marathon
HA Kubernetes on Mesos / MarathonHA Kubernetes on Mesos / Marathon
HA Kubernetes on Mesos / Marathon
 
DCSF19 Containers for Beginners
DCSF19 Containers for BeginnersDCSF19 Containers for Beginners
DCSF19 Containers for Beginners
 
Mesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overviewMesos and Kubernetes ecosystem overview
Mesos and Kubernetes ecosystem overview
 
[Open infra] how to calculate the cloud system operating rate
[Open infra] how to calculate the cloud system operating rate[Open infra] how to calculate the cloud system operating rate
[Open infra] how to calculate the cloud system operating rate
 
Scaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloadsScaling notebooks for Deep Learning workloads
Scaling notebooks for Deep Learning workloads
 
Autoscaling Kubernetes
Autoscaling KubernetesAutoscaling Kubernetes
Autoscaling Kubernetes
 
Webinar: OpenStack Benefits for VMware
Webinar: OpenStack Benefits for VMwareWebinar: OpenStack Benefits for VMware
Webinar: OpenStack Benefits for VMware
 
TIAD : Automating the modern datacenter
TIAD : Automating the modern datacenterTIAD : Automating the modern datacenter
TIAD : Automating the modern datacenter
 
Optimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deploymentsOptimizing Cloud Foundry and OpenStack for large scale deployments
Optimizing Cloud Foundry and OpenStack for large scale deployments
 
DCEU 18: Automating Docker Enterprise: Hands-off Install and Upgrade
DCEU 18: Automating Docker Enterprise: Hands-off Install and UpgradeDCEU 18: Automating Docker Enterprise: Hands-off Install and Upgrade
DCEU 18: Automating Docker Enterprise: Hands-off Install and Upgrade
 
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation SlidesKubernetes Concepts And Architecture Powerpoint Presentation Slides
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
 
Big data and Kubernetes
Big data and KubernetesBig data and Kubernetes
Big data and Kubernetes
 
Enabling NFV features in kubernetes
Enabling NFV features in kubernetesEnabling NFV features in kubernetes
Enabling NFV features in kubernetes
 
Using Docker For Development
Using Docker For DevelopmentUsing Docker For Development
Using Docker For Development
 

Andere mochten auch

BFT223: Chapter 5 selection
BFT223: Chapter 5 selectionBFT223: Chapter 5 selection
BFT223: Chapter 5 selectionsufinozuhaily
 
Chp01 intro
Chp01 introChp01 intro
Chp01 introadm2002
 
How linked in helps you to find a suitable job
How linked in helps you to find a suitable jobHow linked in helps you to find a suitable job
How linked in helps you to find a suitable jobVinish Scaria
 
Tech connect pd data and teaching
Tech connect pd data and teachingTech connect pd data and teaching
Tech connect pd data and teachingLa Shelia Gordon
 
Tungkol kay jose rizal
Tungkol kay jose rizalTungkol kay jose rizal
Tungkol kay jose rizalELISEO4771646
 
Презентація закладу
Презентація закладуПрезентація закладу
Презентація закладуdnz36stan
 
Kanji pict o-graphix
Kanji pict o-graphixKanji pict o-graphix
Kanji pict o-graphixGhjj Ghjj
 

Andere mochten auch (12)

Web API Management
Web API ManagementWeb API Management
Web API Management
 
Mars
MarsMars
Mars
 
บทที่9
บทที่9บทที่9
บทที่9
 
BFT223: Chapter 5 selection
BFT223: Chapter 5 selectionBFT223: Chapter 5 selection
BFT223: Chapter 5 selection
 
Chp01 intro
Chp01 introChp01 intro
Chp01 intro
 
How linked in helps you to find a suitable job
How linked in helps you to find a suitable jobHow linked in helps you to find a suitable job
How linked in helps you to find a suitable job
 
บทที่12
บทที่12บทที่12
บทที่12
 
Tech connect pd data and teaching
Tech connect pd data and teachingTech connect pd data and teaching
Tech connect pd data and teaching
 
Tungkol kay jose rizal
Tungkol kay jose rizalTungkol kay jose rizal
Tungkol kay jose rizal
 
บทที่2
บทที่2บทที่2
บทที่2
 
Презентація закладу
Презентація закладуПрезентація закладу
Презентація закладу
 
Kanji pict o-graphix
Kanji pict o-graphixKanji pict o-graphix
Kanji pict o-graphix
 

Ähnlich wie Deploying Apache Spark as a Service on IBM Power Systems

Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMAlluxio, Inc.
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next DecadePaula Koziol
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsYong Feng
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architectureMatsuo Sawahashi
 
Superior Cloud Economics with Power Systems
Superior Cloud Economics with Power Systems Superior Cloud Economics with Power Systems
Superior Cloud Economics with Power Systems IBM Power Systems
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Indrajit Poddar
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Sri Ambati
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Indrajit Poddar
 
Architecting with power vm
Architecting with power vmArchitecting with power vm
Architecting with power vmCharlie Cler
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLNordic APIs
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learnJohn D Almon
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...Hendrik van Run
 
Cloudify 4.6 highlights webinar
Cloudify 4.6 highlights webinarCloudify 4.6 highlights webinar
Cloudify 4.6 highlights webinarCloudify Community
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectDevOps.com
 
Migrating Mission-Critical Workloads to Intel Architecture
Migrating Mission-Critical Workloads to Intel ArchitectureMigrating Mission-Critical Workloads to Intel Architecture
Migrating Mission-Critical Workloads to Intel ArchitectureIntel IT Center
 

Ähnlich wie Deploying Apache Spark as a Service on IBM Power Systems (20)

Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
Superior Cloud Economics with Power Systems
Superior Cloud Economics with Power Systems Superior Cloud Economics with Power Systems
Superior Cloud Economics with Power Systems
 
Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...Enabling a hardware accelerated deep learning data science experience for Apa...
Enabling a hardware accelerated deep learning data science experience for Apa...
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
 
Architecting with power vm
Architecting with power vmArchitecting with power vm
Architecting with power vm
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
 
Cloudify 4.6 highlights webinar
Cloudify 4.6 highlights webinarCloudify 4.6 highlights webinar
Cloudify 4.6 highlights webinar
 
Modernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-ArchitectModernizing Testing as Apps Re-Architect
Modernizing Testing as Apps Re-Architect
 
Migrating Mission-Critical Workloads to Intel Architecture
Migrating Mission-Critical Workloads to Intel ArchitectureMigrating Mission-Critical Workloads to Intel Architecture
Migrating Mission-Critical Workloads to Intel Architecture
 

Kürzlich hochgeladen

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Kürzlich hochgeladen (20)

Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Deploying Apache Spark as a Service on IBM Power Systems

  • 1. Lessons Learned from Deploying Apache Spark as a Service on IBM Power Systems in the Cloud Indrajit (I.P) Poddar, STSM, IBM Systems Technical Strategy Randy Swanberg, DE, IBM Power Systems Software and Solutions
  • 2. Please Note: 1 • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. • Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
  • 3. Agenda • Infrastructure considerations for a differentiated cloud data service • Apache Spark – a popular big data framework • Lessons learned on an alternative infrastructure with OpenPOWER systems 1. Open source stack for cloud native agile development 2. Management stack for automation and continuous integration 3. Efficient resource allocation and scheduling for multi-tenancy 4. Cloud infrastructure economics 5. Potential of acceleration under the cloud service • Putting it all together • Summary / Questions 2
  • 4. Infrastructure for a differentiated Cloud Data Service 1. Agile Open Source development experience • Dynamic and flexible provisioning and management • Automated deployment and continuous integration 2. Cost effective high performance server infrastructure 3. Economical cloud storage service with encryption 3
  • 5. Big Data in the Cloud Example use cases, architectures and components..
  • 6. Big Data Journey 5 Operations Data Warehouse Insight Inspired Decision Making Insight Driven Business Transformation Value Big-Data Maturity • Cheaper Storage • Data Lake • ETL Offload • Cold Data Offload • Queryable Archive • Full Data Analysis (not just samples) • Extract Value from non-relational data • View of all enterprise data • Exploratory Analysis and Discovery • New Business Models • Real time risk aware decision making • Real time fraud and threat detection • Optimize operations • Attract and Retain Customers Most are somewhere here
  • 7. Demand for Business Value from ALL Data Sources Transaction and application data Machine, sensor data Enterprise content Image, geospatial, video Social data Third-party data Deep Analytics data zone EDW and data mart zone New Customer Insights Discover Relationship Risk Aware Decisions Early Warnings New Business Opportunities Fraud Detection
  • 8. What is Apache Spark? • Unified Analytics Platform – Combine streaming, graph, machine learning and sql analytics on a single platform – Simplified, multi-language programming model – Interactive and Batch • In-Memory Design – Pipelines multiple iterations on single copy of data in memory – Superior Performance – Natural Successor to MapReduce 7 Fast and general engine for large-scale data processing Spark Core API R Scala SQL Python Java Spark SQL Streaming MLlib GraphX
  • 9. Anatomy of Apache Spark as a Cloud Service Stack components, web services, continuous integration..
  • 10. Interactive iPython notebook with Matplotlib GUI 9 Host interactive analytic apps on ipython server to ease code sharing and reuse
  • 11. Architectural components of a Apache Spark Cloud Service 10 Object Store for dataPlatform as a Service Multi-tenant Spark drivers and executors Multi-tenant interactive Jupyter ipython notebook servers with matplotlib GUI Shared Compute Cluster Infrastructure as a Service Bare metals Virtual Machines Continuous Integration Build Deploy Shared storage IBM Spectrum Scale (GPFS) Test
  • 12. A prototype deployment on POWER systems 11 Goal: Production deployment IBM Power System S812LC and S822LC, Tyan OpenPOWER Development environment Continuous Integration Deployment Automation Bare metals Virtual Machines Docker Containers Docker containers
  • 13. Lessons Learned Open source stack, efficient resource allocation and continuous integration, better economics and potential for acceleration on OpenPOWER systems ..
  • 14. Analytics open source stack for agile development 13
  • 15. OpenStack on POWER for Dev-Test environments 14 We used IBM Cloud Manager version 4.3 with OpenStack Kilo on PowerKVM 2.1.1 IBM Cloud Orchestrator is another option
  • 16. Continuous Integration with IBM Urban Code, OpenStack and Docker targeting POWER systems 15 Create multiple deployment and development environments and visual deployment processes in IBM Urban Code Deployment Run only UCD agents and relay on POWER VMs or bare metals
  • 17. Continuous Integration Flows 16 Urban Code Deploy Server (x86) Git Server (x86) Asset Repository (x86) Dev-Test Env (OpenPOWER OpenStack VMs) Build Env (POWER Docker Containers) Future Production Env (OpenPOWER Bare metals) 1. Check in automation code 2. Build artifacts in Docker 3. Store built artifacts in a repository 4. Pull in artifacts into deployment automation 5. Deploy artifacts into dev-test env 6. Deploy artifacts into prod env
  • 18. Efficient resource allocation using Platform Symphony 17 • Share system resources (CPU, memory) with a distributed scheduler • Platform Symphony with Application Service Controller (ASC) V7.1.1 and the EGO scheduler for Ubuntu 14.04.2 on POWER • Platform Symphony + ASC features • Fine-grained scheduling • Resource reclaim • Standby service • Data locality • Shared storage through IBM Spectrum Scale™
  • 19. Economics of the Server Infrastructure • Attributes directly influencing the economics of hosting a Cloud Service: • Number of servers needed to deliver a competitive quality of service and response time • Cost of the individual servers or rental • Number of users and jobs that can be hosted concurrently (multi-tenancy) • End user price charged for the service • All of which are directly impacted by the performance of the Server 18
  • 20. Measuring Performance of Spark on POWER 19 The following charts show Performance results of comparing multiple Spark Workloads from SparkBench using data sizes from 100GB to 10TB (https://github.com/SparkTC/spark-bench) 7-node cluster of Intel Haswell servers • E5-2620 V3 • 12-core • 256GB vs 7-node cluster of POWER servers • POWER8 S812LC • 10-core • 256GB • Machine Learning (Spark MLlib) • Matrix Factorization • Logistic Regression • Support Vector Machine • SQL (Spark SQL) sqlContext.sql("SELECT COUNT(*) FROM orderTab").count() sqlContext.sql("SELECT COUNT(*) FROM orderTab where bid>5000").count() sqlContext.sql("SELECT * FROM oitemTab WHERE price>250").count() sqlContext.sql("SELECT * FROM oitemTab WHERE price>500").count() sqlContext.sql("SELECT * FROM orderTab r JOIN oitemTab s ON r.oid = s.oid").count() • Graph (Spark GraphX) • Page Rank • Triangle Count • Singular Value Decomp++
  • 21. System Performance of Spark on POWER 20 Machine Learning SQL Graph 1.7X Raw System Performance Options for the Service Provider: • Deliver higher qualities of service • 70% faster job completion times on average • Faster time-to-insight • Charge higher premium for the service • Competitive advantage for the service
  • 22. 21 Per Core Performance of Spark on POWER Per-core Performance View • 70 POWER8 Cores vs. 84 Intel Cores • Enables headroom for better system resource utilization Machine Learning SQL Graph 2X
  • 23. 22 Price Performance of Spark on POWER Machine Learning SQL Graph 1.5X Price Performance View* Options for the Service Provider: • Spend 33% less on infrastructure supporting the same amount of workload • Spend the same on infrastructure but host 50% more workload • Lower the price for the service for competitive advantage * - based on preliminary SoftLayer pricing targets – subject to change
  • 24. But wait….there’s more to the story 23 0 0.5 1 1.5 2 2.5 3 E5-2620 v3100GB M at.Fact. 100GB (in m em )LR 1TB (in m em )LR 1TB (50/50)LR1TB SVM 10TB LR 1TB 5 query 2TB 5 query130GB Page Rank 1TB Triangle Cnt 1TB SVD++ AVERAGE RelativeSystemPerformance Spark Workloads • A Deeper Look at the System Performance profile for one of the workloads close to our overall average relative performance • Machine Learning Logistic Regression on a 1TB data set that had a relative performance of 1.74X Machine Learning SQL Graph 1.7X
  • 25. More efficient use of resources (Spark 1TB Logistic Regression Example) 24 0 20 40 60 80 100 22:44 22:44 22:45 22:45 22:46 22:46 22:47 22:47 22:48 22:48 22:49 22:49 22:50 22:50 22:51 22:51 22:52 22:52 22:53 22:53 22:54 22:54 22:55 CPU POWER User% Sys% Wait% 0 20 40 60 80 100 10:14 10:14 10:15 10:16 10:16 10:17 10:18 10:18 10:19 10:20 10:20 10:21 10:22 10:22 10:23 10:24 10:24 10:25 10:26 10:26 10:27 10:28 10:28 CPU Haswell User% Sys% Wait% -1500 -1000 -500 0 500 1000 22:44 22:44 22:45 22:45 22:46 22:46 22:47 22:47 22:48 22:48 22:49 22:49 22:50 22:50 22:51 22:51 22:52 22:52 22:53 22:53 22:54 22:54 22:55 MB/sec Network I/O POWER Total-Read Total-Write (-ve) -600 -500 -400 -300 -200 -100 0 100 200 300 400 10:14 10:14 10:15 10:15 10:16 10:17 10:17 10:18 10:18 10:19 10:19 10:20 10:21 10:21 10:22 10:22 10:23 10:24 10:24 10:25 10:25 10:26 10:26 10:27 10:28 10:28 MB/sec Network I/O Haswell Total-Read Total-Write (-ve) POWER • CPU headroom to host higher density • More data pushed over network due to higher thread density Haswell • CPU fully pegged on just this workload • Underutilizing the Network Resource
  • 26. 0 50000 100000 150000 200000 250000 300000 350000 400000 Runtime(ms) Total Heap Memory x Degrees of Separation on Spark Disk CAPI/Flash 25 CAPI Flash for RDD Cache = 4X memory reduction at equal performance Next Steps - Acceleration in the Cloud RDMA for Spark Shuffle = 30% Better Response Time, Lower CPU Utilization • CAPI Flash and RDMA can be Leveraged Transparently to Spark Applications under the Cloud Service • Coming…. HDFS CAPI FPGA Erasure Code Acceleration, CAPI FPGA Compression Acceleration, ….
  • 27. 26 Acceleration of Spark with GPUs: • Adverse Drug Reaction Prediction built on Spark • 25X Speed up for Building Model stage (using Spark Mllib Logistic Regression) • Again, Transparent to the Spark Application • Game changer for Personalized Medicine
  • 28. More efficient, cost effective, balanced cloud resources • Better quality of service through workload acceleration and real time insights • Efficient scale out architecture avoiding imbalanced resources • New controls to balance resource utilization 27 GPUs and FPGAs for Compute offload, consolidation, specialized acceleration CAPI Flash for Memory consolidation/expansion, and Storage acceleration RDMA for better latency, better network utilization and lower CPU utilization
  • 29. Fostering Acceleration in the Cloud : SuperVessel for OpenPOWER ptopenlab.com 28
  • 30. Putting it All Together…. 29 Object Store for dataPlatform as a Service Multi-tenant Spark drivers and executors Multi-tenant interactive Jupyter ipython notebook servers with matplotlib GUI Shared Compute Cluster Infrastructure as a Service Bare metals Virtual Machines Continuous Integration Build Deploy Shared storage IBM Spectrum Scale (GPFS) Test Docker
  • 31. Summary 30 • Big Data solutions in the Cloud demand elasticity and scale • Real time insights from all sources of data will become the norm • Try open source Apache Spark with IBM Platform Symphony • OpenPOWER systems can differentiate your cloud data service through: • Improved cloud infrastructure economics and cost performance advantage • An agile open source development experience • Advanced forms of acceleration in cloud infrastructures will further differentiate services
  • 32. Notices and Disclaimers 31 Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law
  • 33. Notices and Disclaimers Con’t. 32 Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained h erein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
  • 34. Thank You Your Feedback is Important! Access the InterConnect 2016 Conference Attendee Portal to complete your session surveys from your smartphone, laptop or conference kiosk.