Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) /// lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices.
There is a long list of crucial questions to think about. How fast is the data flying at you? Are your Big Data analyses tightly integrated with existing systems? Or parallel and complex? Can you tolerate a minute of latency? Do you accept data loss or generous SLAs? Is imperfect security good enough?
The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes.
This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.
How to Troubleshoot Apps for the Modern Connected Worker
Don't be Hadooped when looking for Big Data ROI
1. Capturing Big Value in Big Data –
How Use Case Segmentation Drives
Solution Design and Technology
Selection at Deutsche Telekom
Jürgen Urbanski
Vice President Cloud & Big Data Architectures & Technologies, T-Systems
Cloud Leadership Team, Deutsche Telekom
Board Member, BITKOM Big Data & Analytics Working Group
2. Inserting Hadoop in your organization – value
proposition by buying center / stakeholder
IT Infrastructure IT Applications LOB CXO
Higher
New
business
models
Faster
customer
acquisition
Potential Better
value Lower product
enterprise development
data Better quality
warehouse
Lower churn
Lower cost
storage cost Lower fraud
Etc.
Lower
Shorter Longer
Time to value
1
3. Waves of adoption – crossing the chasm
Wave 3
Wave 2 Real-Time Orientation
Interactive Orientation
Wave 1
Batch Orientation
Adoption Mainstream, Early adopters, Bleeding edge,
today 70% of organizations 20% of organizations 10% of organizations
Example use Enterprise log file Forensic analysis Sensor analysis
cases analysis Analytic modeling “Twitterscraping”
ETL offload BI user focus Telematics
Active archive Process optimization
Fraud detection
Clickstream
analytics
Response time Hour(s) Minutes Seconds
Data Volume Velocity
characteristic
Architectural EDW / RDBMS talk Analytic apps talk Derived data also
characteristic to Hadoop directly to Hadoop stored in Hadoop
2
4. Data warehouse and ETL offload are promising
use cases with immediate ROI
Data Warehouse Offload
– Legacy data warehouse costly so can only keep one year of data
– Older data is stored but “dark,” cannot swim around and explore it
– With HDFS you could explore it, active archive
– “Data refinery" where massively parallel processing (MPP) solution is
saturated performance wise
ETL Offload
– ETL may have more than a dozen steps
– Many can be offloaded to a Hadoop cluster
Mainframe Offload
– May have potential
3
5. Big Data is about new application landscapes
New apps taking advantage of Big Data
Rapid app development
Bridges back to legacy systems (wrapping with API, or data integration
via federation or data transport)
New data fabrics for a new IT Fast data
More data In real-time
More sources In context (what, when,
More types who, where)
In ONE place Telemetry / sensor based
NOSQL databases (serving humans or
machines, where you
need to reason over data
as it comes in RT)
These 3 areas need to come together in a platform
Cloud abstraction (so it can run on any private or public cloud, no lock-in)
Automated deployment and monitoring (rolling upgrades, no patching)
Various deployment form factors (on premise as software, on premise as appliance, in the cloud)
4
6. Example application landscape
Machine Learning
Real Time (Mahout, etc…)
Streams
(Social,
sensors)
Real-Time
Processing
(s4, storm,
spark) Data Visualization
(Excel, Tableau)
ETL Real Time Interactive HIVE
Database Analytics
(Impala,
(Shark, Batch
(Informatica, Talend, Greenplum,
Spring Integration)
Gemfire, hBase,
AsterData,
Processing
Cassandra) (Map-Reduce)
Netezza…)
Structured and Unstructured Data
(HDFS, MAPR)
Cloud Infrastructure
Compute Storage Networking
Source: Vmware
7. Reference architecture – high-level view
Presentation
Application
Data
Operations
Security
Inte-
gration
Data Processing
Data Management
Infrastructure
6
8. Reference architecture – component view
Data Presentation
Integration
Workflow and Scheduling
Data Isolation
Data Visualization and Reporting Clients
Real Time
Ingestion
Application
Analytics Apps Transactional Apps Analytics Middleware
Batch
Access Management
Ingestion
Operations
Security
Data Processing
Data Real Time/Stream
Batch Processing Search and Indexing
Management and Monitoring
Connectors Processing
Data Management
Metadata Distributed
Data Encryption
Services Distributed Non-relational Structured
Storage
Processing DB In-Memory
(HDFS)
Infrastructure
Virtualization Compute / Storage / Network
7
9. Questions to ask in designing a solution
for a particular business use case
Presentation What physical infrastructure best fits your needs?
What are your data placement requirements (service provider
Data Application
Operations
Inte-
Security
gra-
tion Data Processing data centers or on-premise, jurisdiction)?
Data Management
Infrastructure Innovation: Cheaper storage
but not just storage…
Illustrative acquisition cost ? !
SAN Storage NAS Filers Enterprise Class White Box DAS1) Data Cloud1)
3-5€/GB 1-3€/GB Hadoop Storage 0.50-1.00€/GB 0.10-0.30€ /GB
???€/GB
Based on HDS Based on Netapp Based on Netapp Hardware can be Based on large
SAN Storage FAS-Series E-Series (NOSH) self-assembled scale object
storage interfaces
1) Hadoop offers Storage + Compute (incl. search). Data Cloud offers Amazon S3 and native storage functions 8
10. Dat Presentation
a
Operations
Application
Security
Inte
Questions to ask in designing a solution -
gra-
tion
Data Processing
Data Management
for a particular business use case Infrastructure
Enterprise Class Hadoop Enterprise Class Hadoop
Packaged ready-to-deploy modular Packaged ready-to-deploy modular Hadoop
Compute / Memory intensive Hadoop cluster cluster
Compute intensive applications The Data has intrinsic value $$$
Tic Data Analysis Usable capacity must expand faster than
Extremely tight Service Level compute
expectations Higher storage performance
Severe financial consequences if the Real human consequences if the system fails
analytic run is late (Threats, treatments, financial losses)
System has to allow for asymmetric growth
Compute
Power
Enterprise Class Hadoop
White Box Hadoop Bounded Compute algorithm / Memory
Values associated with early adopters of intensive Hadoop cluster
Hadoop Compute intensive applications
Additional CPUs do not improve run time
Social Media Space Extremely tight Service Level
Contributors to Apache expectations
Strong bias to JBOD
Severe financial consequences if the
Skeptical of ALL vendors
analytic run is late
Need for deeper storage per datanode
Storage Capacity
Source: NetApp 9
11. Questions to ask in designing a solution
for a particular business use case
Presentation Do you run your Hadoop cluster bare-metal or virtual? Most
Data Application run bare-metal today but virtualization helps with…
Operations
Inte-
Security
gra-
tion Data Processing – Different failure domains
Data Management – Different hardware pools
Infrastructure
– Development vs. production
Three big types of isolation are required for mixing workloads:
Resource Isolation
– Control the greedy neighbor
Nosy – Reserve resources to meet needs
Version Isolation
– Allow concurrent OS, App, Distro versions
Reckless – For instance, test/dev vs. production, high
performance vs. low cost
Security Isolation
– Provide privacy between users/groups
– Runtime and data privacy required
Adapted from: Vmware, see Apache Hadoop on vSphere http://www.vmware.com/de/hadoop/serengeti.html 10
12. Questions to ask in designing a solution
for a particular business use case
Presentation Which distribution is right for your needs today vs. tomorrow?
Which distribution will ensure you stay on the main path of
Data Application
Operations
Inte-
Security
gra-
tion Data Processing open source innovation, vs. trap you in proprietary forks?
Data Management
Infrastructure
Widely adopted, mature distribution
GTM partners include Oracle, HP, Dell, IBM
Fully open source distribution (incl. management tools)
Reputation for cost-effective licensing
Strong developer ecosystem momentum
GTM partners include Microsoft, Teradata, Informatica, Talend
More proprietary distribution with features that appeal to some
business critical use cases
GTM partner AWS (M3 and M5 versions only)
Just announced by EMC, very early stage
Differentiator is HAWQ – claims 600x query speed improvement,
full SQL instruction set
Note: Distributions include more than just the Data Management layer but are discussed at this point in the presentation. 11
Not shown: Intel, Fujitsu and other distributions
13. Questions to ask in designing a solution
for a particular business use case
Presentation What data sources could be of value (internal vs. external,
Data
Inte-
Application Operations people vs. machine generated)? Follow data privacy for
Security
gra-
tion Data Processing people-generated data.
Data Management How much data volume do you have (entry barrier discussion)
Infrastructure
and of what type (structured, semi, unstructured)?
Data latency requirements (measured in minutes)?
Hadoop APIs NFS for file- REST APIs ODBC (JDBC)
for Hadoop based for internet for SQL-based
Applications applications access applications
12
14. Questions to ask in designing a solution
for a particular business use case
Presentation What type of analytics is required (machine learning,
Data Application statistical analysis)?
Operations
Inte-
Security
How fast do decisions need to be made (decision latency)?
gra-
tion Data Processing
Data Management
Is multi-stage data processing a requirement (before data
Infrastructure
gets stored)?
Do you need stream computing and complex event
processing (CEP)? If so do you have strict time-based SLAs?
Is data loss acceptable?
How often does data get updated and queried (real time vs.
batch)?
How tightly coupled are your Hadoop data with existing
relational data sets?
Which non-relational DB suits your needs? Hbase and
Cassandra work natively on HDFS, while Couchbase and
MongoDB work on copies of the data
Stay focused on what is possible quickly
13
15. Innovations: Store first, ask questions later
Data
Parallel processing (scale out)
Presentation
Application
Operations
Inte-
Security
gra-
tion Data Processing
Data Management
“Hadoop”
Infrastructure
High Performance Ecosystem
BI Forward-looking
Legacy BI predictive analysis
Quasi-real-time
analysis Questions defined in
Backward-looking the moment, using
analysis Using data out of
Business business applications data from many
Using data out of sources
problem business applications
Selected Vendors
SAP Business Objects Oracle Exadata Hadoop distributions
IBM Cognos SAP HANA
Technology MicroStrategy
Solution Data Type/Scalability
Structured Structured Structured or
Limited (2 – 3 TB in Limited (2 – 8 TB in unstructured
RAM) RAM) Unlimited (20 – 30 PB)
„True“ big data
Legacy vendor definition of big data
16. Questions to ask in designing a solution
for a particular business use case
Presentation Is backup and recovery critical (number of copies in the
Data Application HDFS cluster)?
Operations
Inte-
Security
Do you need disaster recovery on the raw data?
gra-
tion Data Processing
Data Management
How do you optimize TCO over the life time of a cluster?
Infrastructure
How to ensure the cluster remains balanced and performing
well as the underlying hardware pool becomes
heterogeneous?
What are the implications of a migration between different
distributions or versions of one distribution? Can you do
rolling upgrades to minimize disruption?
What level of multi-tenancy do you implement? Even within
the enterprise, one general purpose Hadoop cluster might
serve different legal entities / BUs.
How do you bring along existing talent? E.g., train developers
on Pig, database admins on Hive, IT operations on the
platform
15
18. Do you really need Hadoop?
Is your data structured and less than 10 TB?
Is your data structured, less than 100 TB but tightly integrated with
your existing data?
Is your data structured, more than 100 TB but processing has to
occur real-time with less than a minute of latency?*
Then you could stay with legacy BI landscapes
including RDBMS, MPP DB and EDW
Otherwise
Come and join us on a journey into
Hadoop based solutions!
* Hadoop is making rapid progress in the real-time arena 17
19. ILLUSTRATIVE
Use Hadoop for VOLUME NOT EXHAUSTIVE
You require parallel / complex data processing power
and you can live with minutes or more of latency to derive reports
You need data storage and indexing for analytic applications
Platform
Data MapReduce
Transformation
20. ILLUSTRATIVE
Use Hadoop for VARIETY NOT EXHAUSTIVE
Your data is multi-structured
You want to derive reports in batch on full data sets
You have complex data flows or multi-stage data pipelines
Workflow Mgt.
Data MapReduce
Transformation
Data Visualization
and Reporting
Low Latency
Data Access*
* Hbase and Cassandra work natively on HDFS, while Couchbase and MongoDB work on copies of the data 19
21. ILLUSTRATIVE
Use Hadoop for VELOCITY NOT EXHAUSTIVE
You are inundated with a flood of real-time data: Numerous live
feeds from multiple data sources like machines, business systems
or Internet sources
Data Apache Kafka
Ingestion
You want to derive reports in (near) real time on a sample or full
data sets
Data Visualization
and Reporting
Shark
Fast Analytics*
20
* May also use MPP database
22. Where to start inserting Hadoop in your
company? A call to action…
IT Infrastructure IT Applications LOB CXO
Accelerating implementation Understanding Big Data
– Solution design driven by – Definition
target use cases – Benefits over adjacent and
– Reference architecture legacy technologies
– Technology selection and – Current mode vs. future
POC mode for analytics
– Implementation lessons Assessing the Economic
learnt Potential
– Target use cases by
function and industry
– Best approach to adoption
Puddles, pools Lakes, oceans
AVOID: Systems separated by GOAL: Platform that natively
workload type due to contention supports mixed workloads, shared
service
21
Hinweis der Redaktion
Automated deployment and monitoring. The cloud infrastructure has to provide 10 “verbs” so that the apps don't have to know anything about the infrastructure. Philosophy is No patching, rolling upgrades, constantly compares what the app needs with what the cloud provides
Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.
Presentation Layer: Application Layer:Data Processing Layer: Infrastructure Layer: Data Ingestition Layer:Security Layer:Management & Monitoring LayerAmbari: Apache Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop clusters. Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. Zookeeper: ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. ZooKeeper is utilized significantly by many distributed applications such as HBase. HBase: HBase is the distributed Hadoop database, scalable and able to collect and store big data volumes on HDFS. This class of database is often categorized as NoSQL (Not only SQL). Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop; this provides deep integration into Enterprise Data Warehouses (E.G. Teradata) and with Data Integration tools such as Talend. MapReduce: HadoopMapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS: Hadoop Distributed File System is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid parallel computations. • Talend Open Studio for Big Data: 100% Open Source Code Generator for Graphical User Interface used for Extract Transfer Load, Extract Load Transfer for data movement, cleansing in and out of Hadoop. Data Integration Services – HDP integrates Talend Open Studio for Big Data, the leading open source data integration platform for Apache Hadoop. Included is a visual development environment and hundreds of pre-built connectors to leading applications that allow you to connect to any data source without writing code. Centralized Metadata Services – HDP includes HCatalog, a metadata and table management system that simplifies data sharing both between Hadoop applications running on the platform and between Hadoop and other enterprise data systems. HDP’s open metadata infrastructure also enables deep integration with third-party tools.