Transforming into a data-driven enterprise: paths to success

1© Cloudera, Inc. All rights reserved.
Transforming into a data-driven enterprise:
Paths to success
Philip Carnelley | Research Director, IDC
Michael Wrisley |Analytic Sales Enablement Director, Intel
Wim Stoop | Senior Product Marketing Manager, Cloudera

Transforming into a data-driven enterprise:
paths to success
Philip Carnelley, Research Director, IDC Europe

Digital Transformation: A Board Level Agenda Item
3© IDC Visit us at IDC.com and follow us on Twitter: @IDC
80% of large
European
companies
have “DX” at
the heart of
their
corporate
strategy
Source: IDC, European DX Survey 2017

21%
26%
29%
18%
6%
Digital Resister Digital Explorer Digital Player Digital Transformer Digital Disrupter
Many Organizations Are at a DX Deadlock
© IDC 4
55%
Source: IDC, European Digital Transformation Maturity Model Benchmark, 2017; n=403, May 2017
© IDC Visit us at IDC.com and follow us on Twitter: @IDC

Getting the Pulse to Test Our Ideas
© IDC Visit us at IDC.com and follow us on Twitter: @IDC 5
750 Business and IT Leaders
Across Western Europe
All Major Industries
0 50 100 150 200
Finance and
Insurance
Telco and Media
Public Sector /
Government
Retail
Energy and
Utilities
Manufacturing
and Automotive
Source: IDC survey for Cloudera and Intel, 2017

Recognizing the Significance of Big Data
Analytics to Digital Transformation
43% 70%
Now In2years
“Important/Very Important”
“Important/Very
Important”

The New Digital Platform
7© IDC Visit us at IDC.com and follow us on Twitter: @IDC
Source: IDC
EXTERNAL
PROCESSES
Connected
Processes
Assets
People
INTERNAL
PROCESSES
INTELLIGENT
CORE
Mobile
IoT
AR/VR
BOT
API

The New Digital Platform
8
EXTERNAL
PROCESSES
Connected
Processes
Assets
People
INTERNAL
PROCESSES
INTELLIGENT
CORE
Mobile
IoT
AR/VR
BOT
API
Source: IDC
INTELLIGENT
CORE
Databases
Data
StreamsBig Data
AI/MLAnalytics Decision
Support

But …
12%
44%
33%
11%
Still exploring
Enterprise-wide
platform being
established
Platform available to
customers and partners
37% Infrastructure is unsuitable
44%Skills issues
55% Uncoordinated
Used in isolated
pockets

Paths to Success
Adopt a flexible hybrid
deployment model
Seek to exploit
advanced analytics /
AI
Choose a suitable
platform for advanced
analytics

What Do We Mean By AI?
AI can be viewed in three layers:
• Artificial intelligence — the broadest term, applying to any technique that enables
computers to mimic human intelligence. More precisely, AI is the study and
development of software and hardware that attempts to emulate a human being in
learning and reasoning.
• Machine learning — A subset of AI: the process of creating a statistical model from
various types of data that perform various functions without having to be programmed
by a human. Machine learning models are "trained" by various types of data (often, lots
of data).This category includes deep learning.
• Deep learning — The subset of machine learning composed of algorithms that permit
software to train itself to perform tasks, like speech and image recognition, without
specifying outcomes or goals. These generally rely on the input of large amounts of
data.
Cognitive computing / AI software systems are self-learning, reasoning systems that can augment
or replace human decision-making in situations that involve complexity, very high information volumes,
and/or uncertainty. They are adaptive, iterative and contextual, and make a new class of problems
computable.
AI Systems learn as they operate. They replace logic with data as the primary behavior driver.
They are therefore critically dependent on (big) data.
11© IDC Visit us at IDC.com and follow us on Twitter: @IDC 11
AI
ML
DL

Establish a Flexible, Hybrid Deployment
Model
Source: IDC survey for
Cloudera and Intel, 2017

46%
Using open source data
science frameworks and
languages
Seek to Exploit Advanced Analytics, AI and
Machine Learning
94%
74%
17%
9%5%
20%
31%
17%
Descriptive Predictive Prescriptive Cognitive analytics
Using now
Planning to
use

Establish a Suitable Platform for Big Data,
Advanced Analytics and AI
25%
A quarter of organisations believe
data science to be very or extremely
important to their Big Data Analytics
environment.
This will grow.
31%
Almost one third of
organisations plan to use
self-learning and AI
techniques, e.g. deep
learning and neural nets.
Standard
hardware
platforms have a
key role to play

Recap: Paths to Success
Adopt a flexible hybrid
deployment model
Seek to exploit
advanced analytics /
AI
Choose a suitable
platform for advanced
analytics

Digital Winners are Leaders in Information
16
Source: IDC Custom Research 2016
The more mature an organization is in its information
strategy,
the more impactful its digital transformation efforts are.

Philip Carnelley
Research Director, Enterprise Software
IDC Europe
pcarnelley@idc.com
@PCarnelley

Data Center Group
Michael Wrisley
Industry Technical Specialist

Data Center Group
Begin your AI journey today using
existing, familiar infrastructure
DL training in days HOURS with up
to 113X2 performance vs. prior gen
(2.2x excluding optimized SW1)
Robust support for full range of
AI deployments
Intel® Xeon® Scalable Processors
Scalable performance for widest variety of AI & other datacenter workloads –
including deep learning
1,2Configuration details on slide: 4, 5, 6
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult
other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For
more complete information visit: http://www.intel.com/performance Source: Intel measured as of November 2016
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to
Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction
sets covered by this notice.
Notice Revision #20110804
The AI you need
On the chip you know
Built-in ROI
Potent
Performance
Production
Ready

Data Center Group
Intel’s Role in Accelerating Analytics & AI
Holistic Strategy from Edge-Cloud to the Enterprise
¥Note: Intel® Data Analytics Acceleration Library, Intel® Math Kernel Library, Intel® Math Kernel Library for Deep Neural Networks, BigDL: Distributed Deep Learning on Apache Spark*,
MLib: Apache Spark’s Scalable Machine Learning Library
*Other names and brands may be claimed as the property of others.
Co-
Optimizin
g
Applicatio
ns
Optimized
Libraries Intel® MKL¥ Intel® MKL-DNN¥Intel® DAAL¥Intel® Distribution for Python*
Intel® Nervana™ GraphMovidius MvTensor LibraryMLib* BigDL
Open
Source
Enabling
HARDWA
RE/
SOFTWA
RE
Networking
Lake
Crest
Compute Memory & Storage Artificial Intelligence
Solutions

Data Center Group
BigDL – DL On Your Existing Infrastructure, Now
Make deep learning more accessible to big data and data science
communities
*Other names and brands may be claimed as the property of others.
Continue the use of
familiar SW tools and
HW infrastructure to
build deep learning
applications
Analyze “big data”
using deep learning on
the same Apache
Hadoop*/Spark* cluster
where the data are
stored
Add deep learning
functionalities to the Big
Data (Spark) programs
and/or workflow
Leverage existing
Hadoop/Spark clusters
to run deep learning
applications
Dynamically share with other
workloads (e.g., ETL, data
warehouse, feature engineering,
statistic machine learning, graph
analytics, etc.)

Data Center Group
BigDL Industry Support – Start Today!
Technology Cloud Service
Providers
End Users

Data Center Group
More Resources…..
www.intel.com/bigdata
www.intel.com/ai
www.intel.com/software
Thank You!

Cloudera Enterprise
26
The modern platform for machine learning and analytics optimized for the cloud
EXTENSIBLE
SERVICES
CORE SERVICES
DATA
ENGINEERING
OPERATIONAL
DATABASE
ANALYTIC
DATABASE
DATA CATALOG
INGEST &
REPLICATION
SECURITY GOVERNANCE
WORKLOAD
MANAGEMENT
DATA
SCIENCE
Amazon S3 Microsoft ADLS HDFS KUDU
STORAGE
SERVICES

• Unified security – protects sensitive data with consistent controls,
even for transient and recurring workloads
• Consistent governance – enables secure self-service access to all
relevant data and increases compliance
• Easy workload management – increases user productivity and
boosts job predictability
• Flexible ingest and replication – aggregates a single copy of all data,
provides disaster recovery, and eases migration
• Shared catalog – defines and preserves structure and business
context of data for new applications and partner solutions
Open platform services
Built for multi-function analytics | Optimized for cloud

5 keys to success
1) Build a data-driven culture
2) Develop the right team and skills
3) Be agile/lean in development
4) Leverage DevOps for production
5) Right-size data governance

World-class training, services, and support
3 top big data
certifications
Cloudera University
Fastest route from zero
to production
Professional Services
SCP-certified support
anywhere in the world
Cloudera Support

Published research subscription service
Delivers cutting edge advances in applied ML / AI
Accelerates adoption in large enterprises
Drives demand for our platform
Applied research for machine
learning and data science
Continued machine
learning innovation

Thank you

Data Center Group
Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration. Check with your system manufacturer or retailer or learn more at intel.com.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other
sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other
information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For
more complete information visit http://www.intel.com/performance.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include
SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not
manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets
covered by this notice.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide
cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are
accurate.
© 2017 Intel Corporation.
3D XPoint, Arria, the Arria logo, Intel, the Intel logo, Intel Nervana, Intel Optane, Intel RealSense, Intel Xeon Phi, Stratix and Xeon are trademarks of Intel Corporation in the U.S. and/or
other countries.
*Other names and brands may be claimed as property of others.

Data Center Group
Slide 23 under Potent Performance current footnote #1 (2.2x performance)
2.2X higher deep learning training and inference performance than the prior generation: Platform: 2S Intel® Xeon® Platinum 8180 CPU @ 2.50GHz (28 cores), HT disabled, turbo
disabled, scaling governor set to “performance” via intel_pstate driver, 384GB DDR4-2666 ECC RAM. CentOS Linux* release 7.3.1611 (Core), Linux kernel 3.10.0-514.10.2.el7.x86_64.
SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm, MLC).Performance measured with: Environment variables: KMP_AFFINITY='granularity=fine, compact‘,
OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -d 2.5G -u 3.8G -g performance. Compared with Platform: 2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22
cores), HT enabled, turbo disabled, scaling governor set to “performance” via acpi-cpufreq driver, 256GB DDR4-2133 ECC RAM. CentOS Linux release 7.3.1611 (Core), Linux kernel
3.10.0-514.10.2.el7.x86_64. SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm, MLC). Performance measured with: Environment variables:
KMP_AFFINITY='granularity=fine, compact,1,0‘, OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set -d 2.2G -u 2.2G -g performance. Neon: ZP/MKL_CHWN
branch commit id:52bd02acb947a2adabb8a227166a7da5d9123b6d. Dummy data was used. The main.py script was used for benchmarking , in mkl mode. ICC version used : 17.0.3
20170404, Intel® MKL small libraries version 2018.0.20170425; Inference and training throughput uses FP32 instructions.

Data Center Group
Slide 23 under Potent Performance current footnote #2 (113x)
https://www.intel.com/content/www/us/en/benchmarks/server/xeon-scalable/xeon-scalable-artificial-intelligence.html
Platform 2S Intel® Xeon® Platinum 8180 processor CPU @ 2.50GHz (28 cores) 2S Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz (22 cores)
Hyper Threading HT disabled HT enabled
Turbo Turbo disabled Turbo disabled
Driver Scaling governor set to “performance” via intel_pstate driver Scaling governor set to “performance” via acpi-cpufreq driver
Memory 384GB DDR4-2666 ECC RAM 256GB DDR4-2133 ECC RAM
OS CentOS* Linux release 7.3.1611 (Core) CentOS* Linux release 7.3.1611 (Core)
Kernel Linux kernel 3.10.0-514.10.2.el7.x86_64 Linux kernel 3.10.0-514.10.2.el7.x86_64
SSD SSD: Intel® SSD DC S3700 Series (800GB, 2.5in SATA 6Gb/s, 25nm,
MLC)
SSD: Intel® SSD DC S3500 Series (480GB, 2.5in SATA 6Gb/s, 20nm,
MLC)
Performance
Measurement
Command
Variables
Environment variables: KMP_AFFINITY='granularity=fine, compact‘,
OMP_NUM_THREADS=56, CPU Freq set with cpupower frequency-set -
d 2.5G -u 3.8G -g performance
Environment variables: KMP_AFFINITY='granularity=fine, compact,1,0‘,
OMP_NUM_THREADS=44, CPU Freq set with cpupower frequency-set
-d 2.2G -u 2.2G -g performance
Caffe Revision Caffe: (http://github.com/intel/caffe/), revision
f96b759f71b2281835f690af267158b82b150b5c.
Caffe: (http://github.com/intel/caffe/), revision
f96b759f71b2281835f690af267158b82b150b5c.
Other
Arguments
Training measured with “caffe time” command. Caffe run with “numactl -
l“.
Training measured with “caffe time” command.
Dataset For “ConvNet” topologies, dummy dataset was used. For other topologies,
data was stored on local storage and cached in memory before training.
For “ConvNet” topologies, dummy dataset was used. For other
topologies, data was stored on local storage and cached in memory
before training.
Topologies Topology specs from https://github.com/intel/caffe/tree/master/
models/intel_optimized_models (GoogLeNet v1),
Topology specs from https://github.com/intel/caffe/tree/master/
models/intel_optimized_models (GoogLeNet v1),
Compiler Intel C++ compiler ver. 17.0.2 20170213 GCC 4.8.5
Library Intel® MKL small libraries version 2018.0.20170425 Intel® MKL small libraries version 2017.0.2.20170110

Data Center Group
Hardware Configuration
Processors Platinum 8160 E5-2699 v4
Number of Nodes in Cluster 4 (1 master + 3 workers) 4 (1 master + 3 workers)
Number of Sockets per Node 2 2
Number of Cores per Node 48 Cores/ 96 Threads 44 Cores/ 88 Threads
Clock 2.1 GHz (3.70 GHz Max) 2.2 GHz (3.60 GHz Max)
Cache 33 MB L3 Cache 55MB Smart Cache
Memory
384GB DDR4
(12 x 32GB, 2666 MT/s)
384GB DDR4
(24 x 16GB, 2133 MT/s)
Storage 8x800GB SATA SSD 8x800GB SATA SSD
Network 10 Gigabit 10 Gigabit
Decision Support Workload Performance Comparison

Data Center Group
BIOS Knob SKX BDX
BIOS version SE5C620.86B.01.00.0470.040720170855 SE5C610.86B.01.01.0018.072020161249
Hyper-Threading Enabled Enabled
Other Options Default Default

Data Center Group
* Software Stack A – Old software stack with old software component versions
** Software Stack B – New software stack with upgraded software component versions (more software optimizations included, such as Hive Parquet Vectorization)
Software
Configuration SKX BDX
OS CentOS 7.3 CentOS 7.3
Kernel
3.10.0-
514.el7.x86_64
3.10.0-
514.el7.x86_64
Java
Oracle JDK
1.8.0_121
Oracle JDK
1.8.0_121
Hadoop 2.7.3 2.7.3
File System HDFS HDFS
Hive 2.0.0 2.0.0
Spark 1.6.3 1.6.3
Software
Configuratio
n SKX BDX
OS CentOS 7.3 CentOS 7.3
Kernel
3.10.0-
514.el7.x86_64
3.10.0-
514.el7.x86_64
Java
Oracle JDK
1.8.0_121
Oracle JDK
1.8.0_121
Hadoop 2.7.3 2.7.3
File System HDFS HDFS
Hive
3.0.0-SNAPSHOT
(commit id:
3330403)
3.0.0-SNAPSHOT
(commit id:
3330403)
Spark 2.0.2 2.0.2

Data Center Group
Hardware Configuration (each data node)
Processors E5-2697v4 (BDX) Xeon Platinum 8168 (SKX)
Nodes 8
Number of Sockets 2
Number of Cores / Socket 18 Cores / 36 Threads 24 Cores / 48 Threads
Clock 2.3 GHz 2.7 GHz
L3 Cache 45 MB 33 MB
Memory 768 GB (24 * 32GB Samsung DIMMs @
2133/2400MHz)
768 GB (12 * 64GB Micron DIMMS @
2400MHz)
Data Storage (SATA3 SSDs) 2 * 2 TB + 2 * 1 TB
Network 1 * 10 Gbps Ethernet
TPCx-BB and Hibench System Configuration
Hardware

Data Center Group
BigBench and Hibench System Configuration
Software
Software Configuration
OS CentOS release 7.3
Kernel 3.10.0-514.el7.x86_64
Java 1.8.0_131
Python 2.7.5
Hadoop 2.7.3
File System HDFS
Spark 2.2.0

Data Center Group
Intel® Math Kernel Library
Intel® MLSL
Intel® Data
Analytics
Acceleration
Library (DAAL)
Intel®
Distribution
Open Source
Frameworks
Intel Deep
Learning SDK
Intel® Computer
Vision SDKIntel® MKL MKL-DNN
High
Level
Overview
High performance
math primitives
granting low level
of control
Free open source
DNN functions for
high-velocity
integration with
deep learning
frameworks
Primitive
communication
building blocks to
scale deep learning
framework
performance over a
cluster
Broad data analytics
acceleration object
oriented library
supporting
distributed ML at the
algorithm level
Most popular and
fastest growing
language for
machine learning
Toolkits driven by
academia and
industry for training
machine learning
algorithms
Accelerate deep
learning model
design, training and
deployment
Toolkit to develop &
deploying vision-
oriented solutions
that harness the full
performance of Intel
CPUs and SOC
accelerators
Primary
Audience
Consumed by
developers of
higher level
libraries and
Applications
Consumed by
developers of the
next generation of
deep learning
frameworks
Deep learning
framework
developers and
optimizers
Wider Data Analytics
and ML audience,
Algorithm level
development for all
stages of data
analytics
Application
Developers and
Data Scientists
Machine Learning
App Developers,
Researchers and
Data Scientists.
Application
Developers and
Data Scientists
Developers who
create vision-
oriented solutions
Example
Usage
Framework
developers call
matrix
multiplication,
convolution
functions
New framework
with functions
developers call for
max CPU
performance
Framework
developer calls
functions to
distribute Caffe
training compute
across an Intel®
Xeon Phi™ cluster
Call distributed
alternating least
squares algorithm for
a recommendation
system
Call scikit-learn
k-means function
for credit card
fraud detection
Script and train a
convolution neural
network for image
recognition
Deep Learning
training and model
creation, with
optimization for
deployment on
constrained end
device
Use deep learning to
do pedestrian
detection
…
Data Scientists: Libraries, Frameworks & Tools
Find out more at software.intel.com/ai

Transforming into a data-driven enterprise: paths to success

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Transforming into a data-driven enterprise: paths to success