When working with structured, semi-structured, and unstructured data, there is often a tendency to try and force one tool - either Hadoop or a traditional DBMS - to do all the work. At Vertica, we've found that there are reasons to use Hadoop for some analytics projects, and Vertica for others, and the magic comes in knowing when to use which tool and how these two tools can work together. Join us as we walk through some of the customer use cases for using Hadoop with a purpose-built analytics platform for an effective, combined analytics solution.
3. Hadoop for Big Data Analytics
• Scalable
• Flexible
• Low cost to try out
• Strong community
• But…
–Batch oriented jobs
–Less efficient storage
–“Programmer friendly” (improving)
3 HP Confidential
4. Survey of Big Data Tools
Stats Programs
?
Hadoop Big Data CEP Engines
4 HP Confidential
5. Vertica Analytics RDBMS Platform
Real-time Big Data
SPEED SCALABILITY SIMPLICITY
• Relational DBMS with ACID
• Real-time analytic reporting with SQL
• 50–1000x faster than traditional DBs
• High scalability, elasticity and full parallelism
• Simple install/use with auto setup and tuning
• Industry standard x86 hardware
• Advanced in-database analytics
• Extensible analytics framework
5 HP Confidential
6. We have a Lot in Common …
• Purpose-built from scratch for analytics
• Commodity hardware
• MPP infrastructure, scaling to 100’s nodes and multiple PBs
• Robust
• Diverse use cases with strong market traction
6 HP Confidential
7. … And We Have Differences
• Interface
• Tool chain / ecosystem
• Storage management
• Run time optimization
• Automatic performance tuning
7 HP Confidential
9. Column Store – Sort and Encode for Speed
Student_ID Name Gender Class Score Grade
1256678 Cappiello, Emilia F Sophomore 62 D
1254038 Dalal, Alana F Senior 92 A
1278858 Orner, Katy F Junior 76 C
1230807 Frigo, Avis M Senior 64 D
1210466 Stober, Saundra F Junior 90 A
1249290 Borba, Milagros F Freshman 96 A
1244262 Sosnowski, Hillary F Junior 68 D
1252490 Nibert, Emilia F Sophomore 59 F
1267170 Popovic, Tanisha F Freshman 95 A
1248100 Schreckengost, Max M Senior 76 C
1243483 Porcelli, Darren M Junior 67 D
1230382 Sinko, Erik M Freshman 91 A
1240224 Tarvin, Julio M Sophomore 85 B
1222781 Lessig, Elnora F Junior 63 D
1231806 Thon, Max M Sophomore 82 B
1246648 Trembley, Allyson F Junior 100 A
9 HP Confidential
10. Column Store – Sort and Encode for Speed
Gender Class Grade Score Name Student_ID
F Sophomore D 62 Cappiello, Emilia 1256678
F Senior A 92 Dalal, Alana 1254038
F Junior C 76 Orner, Katy 1278858
M Senior D 64 Frigo, Avis 1230807
F Junior A 90 Stober, Saundra 1210466
F Freshman A 96 Borba, Milagros 1249290
F Junior D 68 Sosnowski, Hillary 1244262
F Sophomore F 59 Nibert, Emilia 1252490
F Freshman A 95 Popovic, Tanisha 1267170
M Senior C 76 Schreckengost, Max 1248100
M Junior D 67 Porcelli, Darren 1243483
M Freshman A 91 Sinko, Erik 1230382
M Sophomore B 85 Tarvin, Julio 1240224
F Junior D 63 Lessig, Elnora 1222781
M Sophomore B 82 Thon, Max 1231806
F Junior A 100 Trembley, Allyson 1246648
Columns used in predicates Correlated values
“indexed” by preceding
column values
10 HP Confidential
11. Column Store – Sort and Encode for Speed
Gender Class Grade Score Name Student_ID
F Freshman A 95 Popovic, Tanisha 1267170
F Freshman A 96 Borba, Milagros 1249290
F Junior A 90 Stober, Saundra 1210466
F Junior A 100 Trembley, Allyson 1246648
F Junior C 76 Orner, Katy 1278858
F Junior D 63 Lessig, Elnora 1222781
F Junior D 68 Sosnowski, Hillary 1244262
F Senior A 92 Dalal, Alana 1254038
F Sophomore D 62 Cappiello, Emilia 1256678
F Sophomore F 59 Nibert, Emilia 1252490
M Freshman A 91 Sinko, Erik 1230382
M Junior D 67 Porcelli, Darren 1243483
M Sophomore B 82 Thon, Max 1231806
M Sophomore B 85 Tarvin, Julio 1240224
M Senior C 76 Schreckengost, Max 1248100
M Senior D 64 Frigo, Avis 1230807
Columns used in predicates Correlated values
“indexed” by preceding
column values
11 HP Confidential
12. Column Store – Sort and Encode for Speed
Gender Class Grade Score Name Student_ID
F Freshman A 95 Popovic, Tanisha 1267170
F Freshman offset A offset 96 Borba, Milagros 1249290
F Junior A 90 Stober, Saundra 1210466
F Junior A 100 Trembley, Allyson 1246648
F Junior C 76 Orner, Katy 1278858
F Junior D 63 Lessig, Elnora 1222781
F Junior D 68 Sosnowski, Hillary 1244262
F Senior A 92 Dalal, Alana 1254038
F Sophomore D 62 Cappiello, Emilia 1256678
F 2 nd
Sophomore 3 rd
F 4 th
59 Nibert, Emilia 1252490
M Freshman A 91 Sinko, Erik 1230382
M
M
I/O
Junior
Sophomore
I/O
D
B
I/O
67
82
Porcelli, Darren
Thon, Max
1243483
1231806
1st I/O M
M
Sophomore
Senior
B
C
85
76
Tarvin, Julio
Schreckengost, Max
1240224
1248100
Reads entire
M Senior D 64 Frigo, Avis 1230807
column
Example query: select avg( Score ) from example where
Class = ‘Junior’ and Gender = ‘F’ and Grade =
12
‘A’
HP Confidential
13. Column Store – Column Based Compression
Compression
Compression Results Ratio
Clickstream 10:1
Audit 10:1
Trading 5:1
SNMP 20:1
Network Logs 60:1
Marketing 20:1
Consumer 30:1
CDR 8:1
0% 20% 40% 60% 80% 100%
Encoded Data Raw Data
13 HP Confidential
14. Query-Driven Data Segmentation and HA
Segment 1
Segment N
RAID-like functionality
Client Facing Network
•
Segment 2
within DB
Cluster Network
Segment 1
Segment 3 • Smart K-safety
Segment 2
• Always-on loads &
queries
Segment N
Segment N-1
14 HP Confidential
15. Automatic Performance Tuning
• Optimal data layout (physical
schema) optimal performance
• User provides
–Logical schema
–Sample data set
–Typical queries
• Database Designer generates data
layout proposals which:
–Optimize query performance
–Optimize data loading throughput
–Minimize storage footprint
• Workload Analyzer
15 HP Confidential
16. Database Designer Case Studies
• Financial Services (vs manual design)
–Queries 4x faster
–Storage: 50% less
–Design cost: 4 minutes vs months
• Marketing & advertising
–All queries fully optimized; storage 10% of raw data
• Retail (vs manual design)
–Queries 2x faster; storage 33% less
• News media (vs manual design)
–Queries comparable; storage 25% less
16 HP Confidential
20. Combining the Strengths
• Hadoop for exploratory analysis
–Especially with existing MR, Pig scripts
• Vertica for stylized, interactive analysis
–For shared features, often faster than Hadoop with a fraction of hardware
resources
• Vertica’s Hadoop connector
21 HP Confidential
21. Hadoop + Vertica Use Case Example
Extract Transform Load
Hadoop HP Vertica
Flume
Connector H D F S Vertica Hadoop
Connector
SQOOP
Connector
Other
Sources
22 HP Confidential
22. More Joint Use Cases
• Parallel import /export to HDFS
• MR for data transformation, Vertica for optimized storage &
retrieval
–Apache log parsing
–Convert JSON into relational tuples
–Sentiment analysis
• Advanced analytics
–Filter, join and aggregation in Vertica
–Intermediate result fed into an MR job
23 HP Confidential
23. Vertica Extensible Analyics SDK
• A framework for user-
defined Functions and
Transformations
–C++ based extensible framework
–Flexible: express a wide range
of analytic computation
–In-process, fully parallel
execution
25 HP Confidential
24. Vertica Community Edition
• Join the community: http://www.vertica.com/community
–Fully featured, 1 TB + 3 nodes (unlimited academic use)
–Open source analytic packages on github
26 HP Confidential
25. HP Hadoop Reference Architecture
End-to-End Scalable Information Management Solution
Systems Management Connectors to move
Analytics tooling to enable
4 CMU Real-time Monitoring 5 subsets of data in Structured Big Data
CMU Push Button Scale Out and out of Hadoop
3 users to create and run Cloudera Enterprise
analytics jobs on
unstructured data BI/Tableau
HP CMU Real-time Monitoring
Datameer HP Vertica
HP CMU Scale Out
Karmasphere Cloudera Flume
Cloudera SQOOP
Cloudera Enterprise
HP VERTICA Application
Hadoop Core Execution engine
Cloudera Distribution of Apache Hadoop
2 and distributed file system
to run massively parallel
processing tasks (Map/Reduce and HDFS)
RDBMS, SAP,
Logs, etc.
Operating System
e.g. RedHat, Suse, CentOS
HP ProLiant and HP Networking
Scale Out Proliant x86 hardware with
1 large amounts of DAS storage to store
and process data
27 HP Confidential
26. HP Hadoop Reference Architecture: Basic
Concepts
Starter Kit Scaled Deployment
Development/POC
Non-Production Environment Starter Kit Modest Scale Production
Typical scale configurations are up to two racks.
Switch Switch Switch
Second Switch Second Switch
Add redundant network/switch
Master Node
Move management nodes to separate Management Node Job Tracker Node
Management Node
racks Name Node Secondary Name Node
Hadoop Slave Nodes
Hadoop Slave Nodes Hadoop Slave Nodes
At Larger Scale – “Hundreds of nodes”
Starter Kit Typical scale configurations are beyond two racks.
6 nodes (2 mgmt, 4 worker), 1 switch
Upgrade switches (better congestion management)
• Optimized for low cost Add additional management nodes
• Configurations generally not Scaled Deployment
fully redundant (single NW/switch) for scaling >2 Racks
• Same hardware as production o Separate name nodes (become very busy, need •
cluster Optimized for scale and resiliency
•
lots of memory) Same hardware as starter kit
28 HP Confidential
27. Visualizing Cluster/Hadoop Performance: Basic
Concepts Key system statistics
CPU Disk Reads # Map Tasks
Node1 75% 300 8
Visualized as “tubes”
…
Node N 65% 315 7
Displayed as gauges (2-dimensional)
CPU Disk Reads Map Reduce
Tasks
CPU Disk Reads Map Reduce
Tasks
(where the z-axis is time)
29 HP Confidential
28. Visualizing Cluster/Hadoop: Normal Run – No Problems
100 Nodes – TeraSort processing on Hadoop
Write Shuffle
Results Long lived tasks
Sort (Move
(1 copy intermediate Many short-
Processing 700 tasks
only) results between lived tasks
(CPU (2 per core)
intensive) nodes)
7823 tasks
Data Read (1 per block)
30 HP Confidential
29. Visualizing Cluster/Hadoop: With Network Problems
100 Nodes – TeraSort processing on Hadoop
Job stalls at 90%
waiting for
remaining tasks
Some tasks take a
long time to finish
(Speculation kicks
Failing switch in)
caused many network
retries
31 HP Confidential
30. In Closing…
• Solutions leveraging Vertica in conjunction with
Hadoop are capable of solving a tremendous range of
analytical challenges.
• Hadoop is great for dealing with unstructured data,
while Vertica is a superior platform for working
with structured/relational data.
• Getting them to work together is easy.
32 HP Confidential
31. Conclusion
• Join the community: http://www.vertica.com/community
• Join the core team: http://www.vertica.com/about/careers/
33 HP Confidential