The Evolution of Data Architecture

The Evolution of
Data Architecture
Wei-Chiu Chuang
2017. 10 @ NCKU
1

Data Value Chain
AI
Machine
Learning
Data Science
Analytics
Big Data
Decision making
Insight
Automated
Decision making
Hype (?)
3

Data is the new Oil
https://www.economist.com/news/leaders/2172165
6-data-economy-demands-new-approach-antitrust-
rules-worlds-most-valuable-resource
4

Fastest way to
transmit 5MB of
data in 1956

6
Fast forward 60
years… transmit
100PB of data in 2016

Once upon a time, processors double in
speed every 18 months …
 The “Moore’s Law”
stopped 10 years ago.
 CPU, RAM and disk almost
stopped improving in
speed ever since.
7

Processor speed has been stagnant
 But data is being generated
at ever increasing speed.
 Hardware improvement
cannot keep up with data
generation.
 Multi-threaded systems,
distributed systems are the
must.
8

Distributed Systems are hard
Programmability
Scalability
Consistency
Availability
Partition Tolerance
Fault Tolerance
9

Big Data/Parallel Computing/Distributed
Sys.
D HPCBig DataCloud
Distributed Systems
10

Modern Data Architecture
How do you:
 transmit
 collect
 store
 compute
Petabyte+ storage on
1000+ compute nodes?
12

Modern Data Center
DataCenter
ToR
Server1
Server10
ToR
Server1
Server10
ToR
Server1
Server10
ToR
Server1
Server10
Aggr Aggr Aggr
Core Core
Internet
AR AR
10Gbps
10Gbps
1Gbps
13

GFS
 Master – slave architecture
 Separation of control plane and
data plane
 Low cost, commodity hardware
 Failures are norm, rather
than exceptions
 Balance availability and network
partition tolerance
Control
messages
Data
messages
GFS
Master
GFS
chunkservers
/foo/bar
GFS
client
14

MapReduce
 A very simple yet powerful
distributed programming model
 Share-nothing architecture
 Programmability
 Data-locality:
 ship compute to data, rather
than shipping data to compute
 Fault tolerance:
 Intermediate state is stored in
storage.
 Failed tasks can be restarted
easily.
Split 0
Split 1
Split 2
worker
worker
worker
Input files Map phase
worker
worker
Intermediate
files
Reduce
phase
Output 0
Output files
Output 1
master
assign
map
assign
reduce
15

Hadoop
 GFS, MapReduce inspired Hadoop
 Initially developed by Yahoo!
 Released in 2006.
 Used by most large enterprises
 Hadoop 3.0 beta 1!
17

2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform
 The stack is continually evolving and growing!
18

Mix and match
Resource Management
YARN Mesos Kubernetes
Storage
HDFS HBase Kudu S3 ADLS
Compute
MapReduce Hive Impala Spark Presto
Pig Drill Solr Storm
Ingest
Kafka
Flume
Beam
Samza
19

Open source in infra & platform
20

Why open source?
 It’s free ($$$)
 No vendor lock-in.
 Faster development and faster adoption.
 A new approach to foster collaboration.
 Open source software is becoming the standard.
21

Sell open source software, really?
 Water is free, but bottled water is not.
 Cloudera sells the “bottle”
 Cloudera’s Distribution of Hadoop.
 The integration of software.
 The support and services.
 The management software is
proprietary. The OSS is free of charge.
22

Market for open source software?
23
0
50
100
150
200
250
300
350
400
FY2015 FY2016 FY2017 FY2018 (f)
Revenue (million USD)
Hortonworks Cloudera MongoDB

Open Source Business Model
• MySQL
Dual licensing
• RedHat, Hortonworks
Support + services
• Java EE, Qt
Open core
• DataBricks, Amazon AWS, Microsoft Azure
Software as a Service
• Google Chrome, Android
Advertising-supported
• Cloudera, Confluent, MongoDB
Hybrid Open Source Software
24

“Big Data” finds many applications
across many industries
IT Healthcare Transportation Retail
Utilities Telecomm Public sector Manufactring
27

Applications and Use cases
 Realtime database for serving internet traffic
 Internet services (Facebook messenger), Twitter, Uber, Airbnb …
 Data analytics
 Assist in the development of new drugs by analyzing millions
of medical records
 Data science / Machine learning
 Fraud detection
 Anti-money laundry
 Cybersecurity
28

Fraud Detection System using Hadoop

The Cloudera Platform for IoT – Data Mgmt. Value Chain
Data Sources Data Ingest Data Storage & Processing
Serving, Analytics &
Machine Learning
ENTERPRISE DATA HUB
Apache Kafka
Stream or batch ingestion of IoT data
Apache Sqoop
Ingestion of data from relational sources
Apache Hadoop
Storage (HDFS) & deep batch processing
Apache Kudu
Storage & serving for fast changing data
Apache HBase
NoSQL data store for real time
applications
Apache Impala
MPP SQL for fast analytics
Cloudera Search
Real time searchConnected Things/ Data
Sources
Other Data Sources Security, Scalability & Easy Management
Deployment Flexibility:
Datacenter Cloud
Apache Spark
Stream & iterative processing, ML

IoT Use Case 1:
Predictive Maintenance

Predictive Maintenance on Thousands of
Industrial Machinery in Real- Time
Challenge:
• Collect and analyze data from
thousands of diverse manufacturing
systems in real-time
Solution:
• iTrak application using Cloudera in
the Cloud to monitor the performance
of individual manufacturing systems
in real-time
• Predictive Maintenance - Proactively
identifying & fixing issues before
they break
MANUFACTURING
» INDUSTRIAL IoT
» PREDICTIVE MAINTENANCE
» IMPROVED EFFICIENCIES
Industrial IoT – Predictive Maintenance
DATA-DRIVEN
PROCESS
CASE STUDY
DATA-DRIVEN
PRODUCTS

Use Case 2:
Connected Vehicles

Using Predictive Maintenance to Improve
Performance and Reduce Fleet Downtime
Challenge:
• Monitor the health of 180,000+ trucks
in real-time in order to minimize
downtime
Solution:
• OnCommand Connection collecting
telematics and geolocation data across
thousands of trucks
• Identify and correct engine problems
early, and increase fleet uptime
• Reduced maintenance costs to $.03
per mile from $.12-$.15 per mile
Connected Vehicles & Telematics
DATA-DRIVEN
PROCESS
CASE STUDY
DATA-DRIVEN
PRODUCTS
TRANSPORTATION
» PREDICTIVE MAINTENANCE
» TELEMETRY
» LOWER TCO

Use Case 3:
Smart Cities & Smart Infrastructure

Enabling the State of Kentucky manage
snow and ice events in real time
Challenge:
• Kentucky Transportation Cabinet (KYTC)
oversees the state’s transportation system,
which includes 27,000 miles of highways, 230
airports and heliports, and more than three
million drivers.
• Needed more efficient approach to inclement
weather road management
Solution:
• KYTC has built a real-time weather response
system that incorporates real-time data from
Waze, HERE, ESRI’s GeoEvent processor, and
Automatic Vehicle Locations (providing
sensor data from salt trucks).
• KYTC aggregates 15-20 million records every
day and process more than a million records
per second.
Data Driven Dept. of Transportation
Source: http://www.routefifty.com/2016/09/data-drives-government/131821/
2016 Data Impact Award Winner
State of Kentucky Department of
Transportation

Use Case 4:
Connected Healthcare

Improve Parkinson's Disease
Monitoring and Treatment through IoT
Challenge:
• Collect and analyze data from
wearables (more than 300 readings
per second) from thousands of
patients in real-time
Solution:
• Cloudera on Intel architecture to
detect patterns in patient data
streaming from wearables
• Continuously monitor the patients
and symptoms to understand the
progression of the disease
objectively
HEALTHCARE
» WEARABLES
» PREDICTIVE ANALYTICS
» IMPROVED CARE
Connected Healthcare
DATA-DRIVEN
PROCESS
CASE STUDY
DATA-DRIVEN
PRODUCTS

Building a Holistic Picture of the US
Securities Market From 50 Billion Daily
Events
• Saving $10-20M in operational
efficiencies annually
• 90-minute queries run in 10 seconds
• Supporting future market growth and a
dynamic regulatory environment.
CUSTOMER 360

Using Big Data to Help Consumers Save
Hundreds of Millions in Utility Bills
• Relevant insight into household energy
use improves energy consciousness
• 2.7+ TWH (terawatt hours) saved to
date
• Motivated consumers to save enough
energy to power every household in Salt
Lake City and St. Louis for a year
CUSTOMER 360
ENERGY & UTILITIES
» PRODUCT INNOVATION
» SERVICE IMPROVEMENT
» IOT

Saving Lives by Detecting Sepsis Early
Enough for Successful Treatment
• Builds a more complete picture of
patients, conditions, and trends
• Has saved 100’s of lives already
• Reduces hospital readmissions
• 2PB+ in multi-tenant environment
supporting 100s of clients
• Secure yet explorable
HEALTHCARE
» 360° CUSTOMER VIEW
» IMPROVED SERVICE

Improving Pediatric Care and Outcomes
• Quantifying effect of ambient noise on
children’s vital signs
• Identifying cancerous genome variants
in 20 minutes (vs. days before)
• Performing fewer CT scans and higher
quality surgeries
CUSTOMER 360
HEALTHCARE
» MACHINE LEARNING
» IOT
» 360o CUSTOMER VIEW

Government Revenue Service
Increasing Customer Convenience
• Provides view of the complete taxpayer
journey
• Creates ability to pre-populate tax
returns for increased ease of use
• Supports move to near-real-time
oversight of operations and faster
response
CUSTOMER 360
GOVERNMENT
» SERVICE IMPROVEMENT
» PROCESS IMPROVEMENT

Driving Growth and Innovation
• Combines 80+ years’ data spanning all
business units and 50 states
• Expedites holistic analysis and reports
by 500X
• Enables more accurate and detailed
predictive models to customize offers,
optimizing pricing, and minimize risk
CUSTOMER 360
INSURANCE
» FRAUD DETECTION

Re-Platformed 1,600 Operational
Databases & Systems onto a Cloudera EDH
• Business & consumer data was spread
over a dozen different customer
databases
• One daily ETL job (processing 1 billion
customer records) used to take 24 hours
• Increased data velocity by 15x
(5 times the data in 1/3 of the time)
Now completes in 1 ½ hours
• BT now has access to the most up-to-
date and centralized data for all their
customers
CUSTOMER
360
TELECOMMUNICATIONS
» IMPROVED SERVICE
» PROCESS IMPROVEMENT
» IT COST REDUCTION

Future
 Hardware evolution:
 Cloud
 40Gbps, 100Gbps networks
 GPU, TPU
 Flash disk
 Application-driven:
 Machine learning, deep learning
 Realtime data stream processing (IoT)
49

Future
How to scale by an order of
magnitude in 5 years?
We are here today
In 10 years?
50

台灣資料工程協會
Click to enter confidentiality information

台灣人參與Apache
葉祐欣謝良奇、蔡東邦陳恩平
戴資力莊偉赳蔡嘉平

Apache Contributor 育才賽

Takeaway
If you only remember 3 things from this talk:
1.Data is the new Oil
2.Open source is the standard
3.Think big! Remember GFS:
failures are the norm rather
than the exception!
54

Thank you
jojochuang@gmail.com / weichiu@apache.org / weichiu@cloudera.com
55

The Evolution of Data Architecture

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie The Evolution of Data Architecture

Ähnlich wie The Evolution of Data Architecture (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Evolution of Data Architecture