Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?

Hadoop vs. RDBMS for
Big Data Analytics...
Why Choose?

Mingsheng Hong Field CTO, HP Vertica
Scott McClellan VP, HP Emerging Applications

Hadoop for Big Data Analytics
• Scalable
• Flexible
• Low cost to try out
• Strong community
• But…
–Batch oriented jobs
–Less efficient storage
–“Programmer friendly” (improving)

3 HP Confidential

Survey of Big Data Tools
Stats Programs

?

Hadoop Big Data CEP Engines
4 HP Confidential

Vertica Analytics RDBMS Platform
Real-time Big Data
SPEED SCALABILITY SIMPLICITY

• Relational DBMS with ACID
• Real-time analytic reporting with SQL
• 50–1000x faster than traditional DBs
• High scalability, elasticity and full parallelism
• Simple install/use with auto setup and tuning
• Industry standard x86 hardware
• Advanced in-database analytics
• Extensible analytics framework

5 HP Confidential

We have a Lot in Common …
• Purpose-built from scratch for analytics
• Commodity hardware
• MPP infrastructure, scaling to 100’s nodes and multiple PBs
• Robust
• Diverse use cases with strong market traction

6 HP Confidential

… And We Have Differences
• Interface
• Tool chain / ecosystem
• Storage management
• Run time optimization
• Automatic performance tuning

7 HP Confidential

Column Store – Column-Based Disk I/O
• Typical FinServ price per stock for 1 day
Column Store - Reads 3 columns
SELECT AVG(price)
AAPL NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
143.74
143.75
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
5/05/09
5/06/09
FROM tickstore
AAPL
NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS NQDS
NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE NYSE

WHERE

BBY 37.03 5/05/09

BBY 37.13 5/06/09

symbol = ‘AAPL” AND date =

‘5/06/09’
Row Store - Reads all columns
AAPL NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 143.74 NYSE NYSE NYSE 5/05/09
AAPL NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 143.74 NYSE NYSE NYSE 5/06/09
BBY NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE 37.03 NYSE NYSE NYSE 5/05/09
BBY 37.13 5/06/09

…
NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE NYSE NYSE NYSE

8 HP Confidential

Column Store – Sort and Encode for Speed
Student_ID Name Gender Class Score Grade
1256678 Cappiello, Emilia F Sophomore 62 D
1254038 Dalal, Alana F Senior 92 A
1278858 Orner, Katy F Junior 76 C
1230807 Frigo, Avis M Senior 64 D
1210466 Stober, Saundra F Junior 90 A
1249290 Borba, Milagros F Freshman 96 A
1244262 Sosnowski, Hillary F Junior 68 D
1252490 Nibert, Emilia F Sophomore 59 F
1267170 Popovic, Tanisha F Freshman 95 A
1248100 Schreckengost, Max M Senior 76 C
1243483 Porcelli, Darren M Junior 67 D
1230382 Sinko, Erik M Freshman 91 A
1240224 Tarvin, Julio M Sophomore 85 B
1222781 Lessig, Elnora F Junior 63 D
1231806 Thon, Max M Sophomore 82 B
1246648 Trembley, Allyson F Junior 100 A

9 HP Confidential

Gender Class Grade Score Name Student_ID
F Sophomore D 62 Cappiello, Emilia 1256678
F Senior A 92 Dalal, Alana 1254038
F Junior C 76 Orner, Katy 1278858
M Senior D 64 Frigo, Avis 1230807
F Junior A 90 Stober, Saundra 1210466
F Freshman A 96 Borba, Milagros 1249290
F Junior D 68 Sosnowski, Hillary 1244262
F Sophomore F 59 Nibert, Emilia 1252490
F Freshman A 95 Popovic, Tanisha 1267170
M Senior C 76 Schreckengost, Max 1248100
M Junior D 67 Porcelli, Darren 1243483
M Freshman A 91 Sinko, Erik 1230382
M Sophomore B 85 Tarvin, Julio 1240224
F Junior D 63 Lessig, Elnora 1222781
M Sophomore B 82 Thon, Max 1231806
F Junior A 100 Trembley, Allyson 1246648

Columns used in predicates Correlated values
“indexed” by preceding
column values
10 HP Confidential

F Freshman A 96 Borba, Milagros 1249290
F Sophomore F 59 Nibert, Emilia 1252490
M Junior D 67 Porcelli, Darren 1243483
M Sophomore B 82 Thon, Max 1231806
M Sophomore B 85 Tarvin, Julio 1240224
M Senior C 76 Schreckengost, Max 1248100

Columns used in predicates Correlated values
“indexed” by preceding
column values
11 HP Confidential

F Freshman offset A offset 96 Borba, Milagros 1249290
F 2 nd
Sophomore 3 rd
F 4 th
59 Nibert, Emilia 1252490
M
M
I/O
Junior
Sophomore
I/O
D
B
I/O
67
82
Porcelli, Darren
Thon, Max
1243483
1231806
1st I/O M
M
Sophomore
Senior
B
C
85
76
Tarvin, Julio
Schreckengost, Max
1240224
1248100
Reads entire

column
Example query: select avg( Score ) from example where
Class = ‘Junior’ and Gender = ‘F’ and Grade =
12
‘A’
HP Confidential

Column Store – Column Based Compression
Compression
Compression Results Ratio
Clickstream 10:1
Audit 10:1
Trading 5:1
SNMP 20:1
Network Logs 60:1
Marketing 20:1
Consumer 30:1
CDR 8:1
0% 20% 40% 60% 80% 100%
Encoded Data Raw Data
13 HP Confidential

Query-Driven Data Segmentation and HA
Segment 1
Segment N

RAID-like functionality
Client Facing Network

•
Segment 2
within DB

Cluster Network
Segment 1

Segment 3 • Smart K-safety
Segment 2

• Always-on loads &
queries

Segment N
Segment N-1
14 HP Confidential

Automatic Performance Tuning
• Optimal data layout (physical
schema)  optimal performance
• User provides
–Logical schema
–Sample data set
–Typical queries

• Database Designer generates data
layout proposals which:
–Optimize query performance
–Optimize data loading throughput
–Minimize storage footprint
• Workload Analyzer
15 HP Confidential

Database Designer Case Studies
• Financial Services (vs manual design)
–Queries 4x faster
–Storage: 50% less
–Design cost: 4 minutes vs months

• Marketing & advertising
–All queries fully optimized; storage 10% of raw data

• Retail (vs manual design)
–Queries 2x faster; storage 33% less

• News media (vs manual design)
–Queries comparable; storage 25% less
16 HP Confidential

Application Integration

17 HP Confidential

Analytics Feature Comparison
• SQL • “Everything”
–Graph analytics • But especially
–Monte carlo simulation –HDFS for storing schema-less data
–Statistical functions –Parse & transform semi-structured
• Extended SQL data
–Clickstream analytics (e.g. –Machine learning
sessionization) –Multi-language scripts and
–Time series analytics libraries
–Pattern matching
–Event series join

•
18
Extensible analytics
HP Confidential

Combining the Strengths
• Hadoop for exploratory analysis
–Especially with existing MR, Pig scripts

• Vertica for stylized, interactive analysis
–For shared features, often faster than Hadoop with a fraction of hardware
resources

• Vertica’s Hadoop connector

21 HP Confidential

Hadoop + Vertica Use Case Example

Extract Transform Load

Hadoop HP Vertica
Flume
Connector H D F S Vertica Hadoop
Connector
SQOOP
Connector

Other
Sources
22 HP Confidential

More Joint Use Cases
• Parallel import /export to HDFS
• MR for data transformation, Vertica for optimized storage &
retrieval
–Apache log parsing
–Convert JSON into relational tuples
–Sentiment analysis

• Advanced analytics
–Filter, join and aggregation in Vertica
–Intermediate result fed into an MR job

23 HP Confidential

Vertica Extensible Analyics SDK

• A framework for user-
defined Functions and
Transformations
–C++ based extensible framework
–Flexible: express a wide range
of analytic computation
–In-process, fully parallel
execution

25 HP Confidential

Vertica Community Edition
• Join the community: http://www.vertica.com/community
–Fully featured, 1 TB + 3 nodes (unlimited academic use)
–Open source analytic packages on github

26 HP Confidential

HP Hadoop Reference Architecture
End-to-End Scalable Information Management Solution

Systems Management Connectors to move
Analytics tooling to enable
4 CMU Real-time Monitoring 5 subsets of data in Structured Big Data
CMU Push Button Scale Out and out of Hadoop
3 users to create and run Cloudera Enterprise
analytics jobs on
unstructured data BI/Tableau

HP CMU Real-time Monitoring
Datameer HP Vertica
HP CMU Scale Out
Karmasphere Cloudera Flume
Cloudera SQOOP
Cloudera Enterprise
HP VERTICA Application

Hadoop Core Execution engine
Cloudera Distribution of Apache Hadoop
2 and distributed file system
to run massively parallel
processing tasks (Map/Reduce and HDFS)
RDBMS, SAP,
Logs, etc.

Operating System
e.g. RedHat, Suse, CentOS

HP ProLiant and HP Networking

Scale Out Proliant x86 hardware with
1 large amounts of DAS storage to store
and process data
27 HP Confidential

HP Hadoop Reference Architecture: Basic
Concepts
Starter Kit Scaled Deployment
Development/POC
Non-Production Environment Starter Kit  Modest Scale Production

Typical scale configurations are up to two racks.
Switch Switch Switch
Second Switch Second Switch
 Add redundant network/switch
Master Node
 Move management nodes to separate Management Node Job Tracker Node
Management Node
racks Name Node Secondary Name Node

Hadoop Slave Nodes
Hadoop Slave Nodes Hadoop Slave Nodes
At Larger Scale – “Hundreds of nodes”
Starter Kit Typical scale configurations are beyond two racks.
6 nodes (2 mgmt, 4 worker), 1 switch
 Upgrade switches (better congestion management)
• Optimized for low cost  Add additional management nodes
• Configurations generally not Scaled Deployment
fully redundant (single NW/switch) for scaling >2 Racks
• Same hardware as production o Separate name nodes (become very busy, need •
cluster Optimized for scale and resiliency
•
lots of memory) Same hardware as starter kit

28 HP Confidential

Visualizing Cluster/Hadoop Performance: Basic
Concepts Key system statistics
CPU Disk Reads # Map Tasks
Node1 75% 300 8
Visualized as “tubes”
…
Node N 65% 315 7

Displayed as gauges (2-dimensional)

CPU Disk Reads Map Reduce
Tasks

CPU Disk Reads Map Reduce
Tasks
(where the z-axis is time)
29 HP Confidential

Visualizing Cluster/Hadoop: Normal Run – No Problems
100 Nodes – TeraSort processing on Hadoop

Write Shuffle
Results Long lived tasks
Sort (Move
(1 copy intermediate Many short-
Processing 700 tasks
only) results between lived tasks
(CPU (2 per core)
intensive) nodes)
7823 tasks
Data Read (1 per block)

30 HP Confidential

Visualizing Cluster/Hadoop: With Network Problems
100 Nodes – TeraSort processing on Hadoop

Job stalls at 90%
waiting for
remaining tasks
Some tasks take a
long time to finish
(Speculation kicks
Failing switch in)
caused many network
retries

31 HP Confidential

In Closing…

• Solutions leveraging Vertica in conjunction with
Hadoop are capable of solving a tremendous range of
analytical challenges.
• Hadoop is great for dealing with unstructured data,
while Vertica is a superior platform for working
with structured/relational data.
• Getting them to work together is easy.

32 HP Confidential

Conclusion
• Join the community: http://www.vertica.com/community
• Join the core team: http://www.vertica.com/about/careers/

33 HP Confidential

Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Mehr von Cloudera, Inc.

Mehr von Cloudera, Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Hadoop World 2011: Hadoop vs. RDBMS for Big Data Analytics...Why Choose?