Data analytics, Spark, Hadoop and AI have become fundamental tools to drive digital transformation. A critical challenge is moving from isolated experiments to an organizational or enterprise production infrastructure. In this talk, we break apart the modern data analytics workflow to focus on the data challenges across different phases of the analytics and AI life cycle. By presenting a unified approach to data storage for AI and Analytics, organizations can reduce costs, modernize their data strategy and build a sustainable enterprise data lake. By anticipating how Hadoop, Spark, Tensorflow, Caffe and traditional analytics like SAS, HPC can share data, IT departments and data science practitioners can not only co-exist, but speed time to insight. We will present the tangible benefits of a Reference Architecture using real-world installations that span proprietary and open-source frameworks. Using intelligent software-defined shared storage, users are able to eliminate silos, reduce multiple data copies, and improve time to insight.PALLAVI GALGALI, Offering Manager,IBM and DOUGLAS O'FLAHERTY, Portfolio Product Manager, IBM
2. Agenda
• IBM Software Defined Storage for Analytics & AI
• IBM AI Infrastructure Reference Architecture
• Why customers are choosing IBM Spectrum Scale storage for Hadoop?
• Popular analytics use cases with IBM Spectrum Scale storage
3. IBM Spectrum Scale is a flexible and scalable software defined file storage
GLOBAL Namespace
Powered by
IBM Spectrum Scale
Automated data placement and data migration
Disk Tape Shared Nothing
Cluster
Flash
Transparent Cloud
Tier
JBOD/JBOF
Spectrum Scale RAID
NFS SMBPOSIX HDFS Object
HPC
Genomics Traditional
applications
New Gen
applications
Enterprise class functionality:
Encryption
Compression
Synchronous Replication
Asynchronous Replication
Backup
Disaster Recovery
Audit Logging
4000+
clients
IBM Spectrum Scale supports file systems with sizes of tens of petabytes that contain billions of files and can be
accessed by thousands of nodes in a cluster.
4. 4
IBM Spectrum Scale – Deployment models
Software
Install software on your own
choice of Industry standard x86/
POWER servers
Pre-built Systems
Elastic Storage Server(ESS)
with Spectrum Scale SW RAID
Cloud Services
Spectrum Scale can be deployed
on IBM Cloud and Amazon Web
Services (AWS)
Spectrum Scale
4
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
EXP3524
8
9
16
17
System x3650 M40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
System x3650 M40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
EXP3524
8
9
16
17
EXP3524
8
9
16
17
EXP3524
8
9
16
17
5. 5
#1 Pure Open Source Hadoop Distribution
1300+ customers and 2100+ ecosystem partners
Employs the original architects, developers and
operators of Hadoop from Yahoo!
Best-in-class 24x7 customer support
Leading professional services and training
#1 SQL Engine for complex, analytical workloads
#1 Data Science Platform (Source: Gartner)
Leader in On-premise and Hybrid Cloud solutions
OpenPOWER performance leadership
Software defined storage with unmatched scalability
+
The Power of ONEOne enterprise end-to-end solution for big data
#1 open source Hadoop platform + IBM’s leading value adds
7. 7
June 19th Announcement
IBM Systems is announcing IBM PowerAI Enterprise and an
AI infrastructure Reference Architecture for on-premises AI deployments.
IBM Systems is addressing the challenges organizations face experimenting
with PoCs, growing into multitenant, production systems, then expanding to
enterprise scale, all while integrating into an organization’s existing IT
infrastructure.
With a set of easy to use, integrated software tools built on optimized,
accelerated hardware, the architecture enables organizations to jump start AI
and Deep Learning projects, speeds time to model accuracy and provides
Enterprise-grade security, interoperability and support.
8. 8
Autonomous
driving
Accident
avoidance
Location-based
advertising
Sentiment analysis of
what’s hot, problems
$
Market prediction
Fraud/Risk
Experiment sensor
analysis
Drilling exploration
sensor analysis
Consumer
sentiment Analysis
Sensor analysis for
optimal traffic flows
Smart Meter
analysis
for network
capacity,
Threat analysis - social
media monitoring, video
Surveillance
Clinical trials, drug
discovery,
Genomics
People & career
matching
Patient sensors,
medical image interpretation
Captioning,
search, real time
translation
Mfg. quality
Warranty
analysis
AI Examples in Every Industry
9. 9
Data Science is a Team Sport
and Iterative
Extract
Data
Build
models
Prepare
Data
Train
Models
Evaluate Deploy
Use
models
Monetize
$$$
Monitor
Building cognitive apps using deep learning requires multiple skillsets
Connected infrastructure for data, development and iteration.
A common data platform and workflow is crucial for enterprise success.
Biz Analyst Dev OpsData Engineer App DeveloperDev OpsData Scientist
IT Supports & Services the Complete Workflow
10. 10
91% I&O Leaders
Across Inquiries
Cited "Data" as a
Main Inhibitor of AI
Initiatives.
This is not easy…
Source: Gartner "AI State of The Market - and Where HPC intersects”
11. 11
Data Source
New Data
Years of
Data
Work flow and data flow is complex
Inference
Trained Model
Deploy in
Production using
Trained Model
Seconds
to results
Data Preparation
Data Cleansing &
Pre-Processing
Training
Dataset
Testing
Dataset
Weeks &
months
Heavy IO
Iterate
Build, Train, Optimize Models
AI Deep Learning
Frameworks
(Tensorflow & Caffe)
Monitor &
Advise
Instrumentation
Distributed &
Elastic Deep Learning
Parallel Hyper-Parameter
Search & Optimization
Network
Models
Hyper-
Parameters
Days & weeks
Traditional
Business
IoT &
Sensors
Collaboration
Partners
Mobile Apps &
Social Media
Legacy
12. Training
Dataset
Testing
Dataset
12
Production
Data
Sensor Data
Data from
collaboration
partners
Data from mobile
app and social
media
Legacy Data
Data Preparation
Pre-Processing
Data Source Model Training Inference
AI Deep Learning
Frameworks
(Tensorflow & IBM Caffe)
Monitor &
Advise
Instrumentation
Iterate
Distributed & Elastic Deep
Learning (Fabric)
Parallel Hyper-Parameter
Search & Optimization
Network
Models
Hyper-
Parameters
Trained Model
Deploy in
Production using
Trained Model
New Data
Years of
Data
Hours of
preparation
Weeks and
months of
training
Seconds to
results
Data requirements varies significantly
Data Variety
Data Quantity
Geo-dispersed,
On-perm & Cloud
Data Efficiency
Data Quality
Data
Gravity
HDFS/Spark
Model Velocity
Workflow Integration
Data Access Density
Data Velocity : Low latency
High throughput
Data Caching
Data Security, Governance and Resilience
14. AI Adoption Cycle
–Single node
–Single user/tenant
–Small scale data
–Algorithm prototyping,
hyperparameter optimization
Experimentation Production Expansion
–Expanding use cases
–Multi-node
–Cluster
–Medium scale data
–Security
–Data Science Shared
Service
–Multitenant
–Upstream data pipeline
–Model iteration
–Scalable Inference
14
15. AI Data Journey
–Single node
–Single user/tenant
–Small scale data
–Algorithm prototyping,
hyperparameter optimization
Experimentation Production Expansion
–Expanding use cases
–Multi-node
–Cluster
–Medium scale data
–Security
–Data Science Shared
Service
–Multitenant
–Upstream data pipeline
–Model iteration
–Scalable Inference
15
Hadoop and Spark are the choice for data pipeline.
17. 17
Reduce datacenter footprint and get
faster ingest with in-place analytics
Data
NFS
SMB POSIX Object
HDFS API
Access to the data using any of the industry standard protocols.
No need to maintain separate copies for different applications.
Grow storage independent of compute with the best data
protection technology
Grow storage independent of compute with pre-integrated ESS system. Eliminate
need for 3 copies of data with SW RAID, Faster disk rebuilds, No data corruption
Extreme scalability with
parallel file system architecture
Data + Metadata
Node
Data + Metadata
Node
Data + Metadata
Node
Data + Metadata
Node
Scale to billions of files.
No centralized metadata node bottleneck.
Global namespace that spans geographies
Stretch clusters and Active – Active replicas of data for real time global collaboration
ESS
Why customers are choosing Spectrum Scale storage for Hadoop?
Faster ingest, unmatched scalability, up-to 60% less storage footprint for Hadoop workloads
1 2
3 4
18. 18
Data Lake: Up to 60% less storage footprint
| 18
Ingest
ObjectFile
Direct Access
POSIX
Raw Data
Analysis
Less hardware
• HDFS Shared Nothing: 15 PB of physical for 5 PB usable
• Spectrum Scale on ESS: 6.5 PB of physical for 5 PB usable
Analytics in place
• No need to maintain copies of data for traditional applications
and analytics applications
Multi-purpose shared data lake
• Shared by Hadoop and many other use cases
19. 19
HDP on Power with Elastic Storage Server
• Improve TCO
Up to 3X reduction of storage and compute
infrastructure moving to Power Systems and Elastic
Storage Server vs commodity scale out x86. Less
infrastructure means reduced costs in many areas
(Energy, cooling, server administration, floor space, SW licensing)
• Position for future growth, avoid hitting the
data center wall with cluster sprawl
Separating storage from compute enables the
selection of the best compute node for the workload
– and Power has the greatest range of options
E E
InfiniBand (RDMA) / 40 GigE / 10 GigE
IBM Power nodes running
HDP services and Spectrum
Scale client
ESS
HDP HDP HDP HDP HDP
ESS Elastic Storage Server(Powered by Spectrum Scale)
C C C C CC
C Spectrum Scale Client + HDFS Connector
21. 21
Challenges …
Expensive EDW (Enterprise Data Warehouse) setups
Silos of infrastructure for various analytics workflows
Multiple copies of the same data
Time consuming data ingest cycle
Unmanageable analytics cluster sprawl
22. 22
Popular use-cases that help eliminate analytics silos
I. EDW Optimization
Optimize data warehouse by shifting right workload to Hadoop
Reduce cost & improve efficiency
II. Integrated HPC and Hadoop
Efficiently transform data into insights with single data lake for HPC & Hadoop
Faster & better insights
IV. Unified Analytics Workflows
Single data lake for Hadoop and non-Hadoop analytics
Improve data governance
III. Hadoop Storage Tiering
Disaggregate storage and compute for better utilization
Reduce cluster sprawl
23. 23
I. EDW Optimization
Optimize data warehouse by shifting right workload to Hadoop
Archive Data away from EDW
- Move cold or rarely used data to Hadoop
as active archive
- Store more of data longer
Offload costly ETL process
- Free your EDW to perform high-value functions like
analytics & operations, not ETL
- Use Hadoop for advanced ETL
Optimize the value of your EDW
- Use Hadoop to refine new data sources, such as web and
machine data for new analytical context
Reduce migration effort & skillset gap
- Use existing investment in Oracle/DB2/Netezza skills
- BigSQL allows you to migrate applications without major
code rewrites and additional SQL development
Control cluster sprawl
- Grow storage independent of compute with ESS
- POWER servers deliver 1.7x throughput compared to
Hortonworks on x86
- Up-to 60% less storage footprint
Enterprise Data
Warehouse
DB2 / Dashdb / Oracle /
Netezza / Teradata …
Hot Data
Hadoop
Cold Data, Archive Data,
New Sources
HDP On Power
SQL Interface BigSQL On Power
Analytics Software
(Business Analytics, Visualization like SAS grid, SAP HANA etc)
ESS for
Speed
ESS for
Data Lake
Spectrum Scale
A Financial Services company in Europe is optimizing their DB2 warehouse using
HDP, BigSQL, Power, ESS combination.
New Data Sources
Streaming / IOT data
HDF On Power
24. 24
II. Integrated HPC and Hadoop
Efficiently transform data into insights with single data lake for HPC & Hadoop
NASA and a Healthcare company from middle east are using common Spectrum Scale data
lake to efficiently get insights using traditional HPC and Hadoop analytics.
ESS for
Data Lake
POSIX
Interface
HDFS
Interface
Traditional HPC
Open, Read, Write, MPI, C-code,
Python etc
Hadoop
Map-Reduce,
Spark, ML/DL etc
HDP On Power
NFS/SMB/Object
Interface
Spectrum Scale
Protocol Node
ESS for
Speed
Fast Ingest
POSIX
Interface
Spectrum Scale
Extend HPC to add modern analytics capabilities
- Efficient movement of data between modern and traditional
applications with common namespace
- Spectrum Scale in-place analytics capabilities enable
accessing the same data using NFS/SMB/Object/POSIX/HDFS
without requiring any modifications to the data
- Improve data reliability and governance with single data lake
Ingest fast and improve time to insight
- POSIX interface combined with ESS Flash storage gives super
fast ingest ability
- Common namespace enables running some edge analytics at
the ingest layer as well
Control cluster sprawl
- Grow storage independent of compute with ESS
- Up-to 60% less storage footprint
- POWER servers deliver 1.7x throughput compared to
Hortonworks on x86
25. 25
III. Hadoop Storage Tiering
Disaggregate storage and compute for better utilization
An Indian conglomerate is implementing ESS based ingest tier to their existing
Hadoop data-lake.
ESS for
Data Lake
POSIX
Interface
HDFS
Interface
New
Hadoop cluster
HDP On PowerESS for
Speed
Fast Ingest
Existing
Hadoop cluster
Native
HDFS Storage
HDFS
Interface
HDFS
Interface
Use ESS as Ingest Tier to existing Hadoop setup
- Get super-fast ingest with POSIX and Flash storage
- Run in-place analytics directly on tier1 storage
Use ESS as Secondary Tier to existing Hadoop setup
- Grow storage independent of compute
- Reduce cluster sprawl
- Share data between old & new Hadoop setups
- Avoid copying data between the two clusters with a common
data lake
- Introduce new IBM Power-based HDP clusters for demanding
next gen analytics workflows on the same data lake
26. 26
IV. Unified Analytics Workflows
Single data lake for Hadoop and non-Hadoop analytics
A bank in South Africa is implementing HDP and SAS grid software on a common
ESS based infrastructure.
ESS for
Data Lake
POSIX
Interface
HDFS
Interface
Other Analytics
Platforms
SAS grid, SAP
HANA/Vora, ML/DL,
Conductor with
Spark etc
Hadoop
Map-Reduce,
Spark, ML/DL etc
HDP On Power
ESS for
Speed
Fast Ingest
POSIX
Interface
Spectrum Scale
All analytics workflows on common storage
- Improve data reliability and governance with single data lake for
Hadoop and non-Hadoop analytics setups
- Build ML/DL workflows that use multiple analytics platforms
- Share data across analytics workflows as appropriate
Ingest fast and improve time to insight
- POSIX interface combined with ESS Flash storage gives super fast
ingest ability
Control cluster sprawl
- Grow storage independent of compute with ESS
- Up-to 60% less storage footprint
- POWER servers deliver 1.7x throughput compared to Hortonworks
on x86
Here is a snapshot of what Spectrum Scale has to offer:
It supports accessing the data using various different access protocols like POSIX, NFS, SMB, HDFS etc. and hence can be used as a data lake to consolidate all your organization’s data. This allows you to strengthen HDP use cases like EDW Offload, Active Archive, Single view of the customer etc.
In the background Spectrum Scale offers automated data placement on any of the storage media like Flash, Disk, Tape, Cloud etc. This helps with storage utilization and cost optimization.
We already have 4000+ enterprise customers using Spectrum Scale today as their data store.
TALK TRACK
Together, we are better able to address the changing dynamics we’ve just outlined, solve the associated challenges and create valuable outcomes. By joining forces, we have brought together Hortonworks deep expertise in data with IBM's data science platform which was a leader in the Magic quadrant and the best SQL Engine broadens our toolset to be able to help you accelerate your business. Additionally we have added IBM Systems differentiated Power and SDS offerings to improve ROI on these investments.
The IBM Reference Architecture for AI Infrastructure is intended to be used as a reference by data scientists and IT professionals who are defining, deploying and integrating AI/ML/DL solutions into an organization. This document describes an architecture that will facilitate a productive proof of concept (PoC) and allow growth into a multitenant, production system that allows for sustained growth to enterprise scale, while integrating the solution into an organization’s existing IT infrastructure
Every one of these is a use case on which IBM Systems has worked on this year.
Not only does the AI workflow involve multiple team members working on complex often manual tasks, each step in the pipeline can weeks or months depending on the a variety of factors. Data Scientists are key contributors to successful model building and training, however -
they are hard to find with experience in orchestrating the ML/DL workflow and emerging AI frameworks and applications.
It requires coordination and cooperation
Even experienced data scientists can be challenged by the distributed data ingest and prep required with large complex AI data sets which can consume 80% of the time spent in an AI project.
Talking points
Data Sources
Data Preparation
High data quality is critical to the success of any AI initiative and the very large, diverse data sets (typically 8-10X than that used for traditional analytics [IDC]), needed for AI create a data integration, transformation and labeling challenge that consumes significant time, human effort, and infrastructure resources. Data sensitivity requires multi-layered security across the AI data pipeline.
IDC: ML and DL algorithms need huge quantities of training data(typically 8-10X than that used for traditional analytics), and
Model Build, Train, Optimize
AI is built on a complex mix of emerging, rapidly changing technologies and requires accelerated, high performance computing environment. Steep data scientist learning curves and open source framework complexity means it can take weeks to get up and running.
Building accurate AI models is a time intensive, often manual process of experimentation and optimization of complex combinations of features and parameters. Training models requires massive amounts of data used in millions of jobs to make a model intelligent. Accessing distributed resources is often a manual, rigid process resulting in fixed, inflexible processing schema resulting in training that can run for weeks or months.
Will add talking points : “NO one is on the same curve..”
TALK TRACK
Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications.
[NEXT SLIDE]
TALK TRACK
Spectrum Scale has its roots in HPC and runs on number of super computers in the world. Customers have started adopting it now as SDS behind Hadoop/Spark based data lakes as well. Apart from standard shared storage advantage of being able to grow storage independent of compute and elimination of 3 way replication that is needed in standard HDFS, Spectrum Scale is also being adopted for its unmatched scalability and faster ingest.
- Reduce datacenter footprint with industry’s best in-place analytics
No need to maintain copies of the data for different applications requiring access methods
-True software defined storage that can be purchased as software only OR pre-integrated system
Can start small with SW only option and still leverage enterprise storage system benefits from day 1. ESS brings advantages of software RAID and eliminates the need for 3 times replication for data protection.
- Extreme scalability with parallel file system architecture
This allows you to grow your Hadoop environment as your data grows without system imposed limitations. Scales upto billions of files and thousands of nodes as against HDFS that scales upto 350 million files due to centralized name node limitations.
Global namespace that can span geographies
This allows global and international organizations to form data lakes across the globe.
POSIX support is one of the key differentiator that Spectrum Scale brings to the table that makes Hortonworks Data platform stronger against MapR.
TALK TRACK
As Hadoop clusters grow, it is quite typical that compute nodes start to become under-utilized as they are added to primarily increase storage capacity. Having a cluster design where compute and storage are locked together in a common building block removes a great deal of flexibility and can result in cluster sprawl and mounting TCO for data center space, power, SW licensing and admin and management costs.
The IBM Elastics Storage Server, which includes the IBM Spectrum Scale file system, is 100% HDFS compatible and allows for the separation of the storage into a high performance, resilient storage appliance (ESS) which then allows the compute nodes to be right sized for the demands of the workload, including mixing in workload optimized nodes such as GPUs. And Power has the most performance compute nodes.
This approach has the significant advantage of having the same data storage plane, and single version of the data, shared with Hadoop and traditional analytics workloads. So, no need to copy data between your POSIX and Hadoop environements and no need to use 3X replication, as is typical in local storage Hadoop models, as ESS includes native SW RAID for complete resiliency with only 30% overhead.
TALK TRACK
Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications.
[NEXT SLIDE]