TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Â
SQL In/On/Around Hadoop
1. SQL In/On/Around Hadoop
Hadoop Summit, 2015
Chris Twogood, Vice President Product and Services Marketing
Fawad Qureshi, Principal Consultant, Big Data
2. 2 Š 2014 Teradata
Over 12 SQL Interfaces for Hadoop
Apache Drill Apache Hive
Apache Phoenix Apache Spark SQL
Apache Tajo Cloudera Impala
IBM BigSQL Oracle Big Data SQL
Pivotal Hawq Presto
Splice Machine SQLstream
Teradata QueryGrid
Source: Gartner Market Guide for Hadoop Distributions, 06 January 2015
3. 3
Query Processing on Hadoop
Š 2014 Teradata
HADOOP HDFS
QUERY
ENGINE
HADOOP HADOOP RDBMS
DATA
VIRTUALIZATION
Map
Reduce
RDBMSHDFSRDBMS
Raw Map
Reduce
RDBMS On
Top Of
Hadoop
Query
Engine Using
HDFS Files
RDBMS Orchestrating
Queries With Remote
Access to Hadoop/Hive
Virtualization
Layer Over All
Data Sources
4. 4
Query Processing on Hadoop â Raw Map Reduce
Š 2014 Teradata
HADOOP
Map
Reduce
RDBMS
⢠Native Map Reduce processing
⢠Direct commands to Hadoop and HDFS
⢠âData Manipulationâ more than âquery processingâ
⢠Programming and Map Reduce skills required
⢠Batch processing focused
⢠Full flexibility to operate on any data in HDFS
5. 5
Query Processing on Hadoop â RDBMS On Top of Hadoop
Š 2014 Teradata
HDFSRDBMS
⢠RDBMS on Hadoop cluster
⢠Proprietary data dictionary/meta data
⢠Proprietary data format within HDFS files
⢠Data types may be limited
⢠SQL query engine
⢠SQL language, but standards compatibility varies
⢠Query engine maturity varies
⢠Data not portable, can not be read by other
systems/engines
⢠Example: Pivotal HAWQ
6. 6
Query Processing on Hadoop â Query Engine Using HDFS Files
Š 2014 Teradata
HDFS
QUERY
ENGINE
⢠SQL query engine on Hadoop cluster
⢠Standard data dictionary/meta data (e.g., Hive)
⢠Standard data format within HDFS files (e.g., ORC files)
⢠Data types may be limited
⢠SQL query engine
⢠SQL language, but standards compatibility varies
⢠Query engine maturity varies
⢠Data âportableâ and can be read by other
systems/engines
⢠Examples: IBM Big SQL, Cloudera Impala
7. 7
Query Processing on Hadoop â RDBMS Orchestrating Queries
With Remote Access to Hadoop/Hive
Š 2014 Teradata
HADOOPRDBMS
⢠External RDBMS sends (part of) queries to engine on Hadoop
⢠Standard data dictionary/meta data within Hadoop cluster (e.g.,
Hive)
⢠Standard data format within HDFS files (e.g., ORC)
⢠Data types may be limited by engine on Hadoop and external
RDBMS
⢠SQL query engine capabilities combination of external and
internal Hadoop engines
⢠Combines data and analytics in two systems
⢠SQL language, standards compatibility generally high
⢠Query engine generally mature
⢠Data in Hadoop âportableâ and can be read by other
systems/engines
⢠Example: Teradata QueryGrid
8. 8
Query Processing on Hadoop â Virtualization Layer Over All
Data Sources
Š 2014 Teradata
HADOOP RDBMS
DATA
VIRTUALIZATION
⢠External virtualization software sends (part of) queries to
engine on Hadoop
⢠Standard data dictionary/meta data within Hadoop cluster (e.g.,
Hive)
⢠Standard data format within HDFS files (e.g., ORC)
⢠Data types may be limited by engine on Hadoop and external
virtualization software
⢠SQL query engine capabilities combination of external and
Hadoop engines and virtualization layer limitations
⢠Combines data and analytics in two systems
⢠Extra layer and/or data movement
⢠SQL language, standards compatibility generally high
⢠Query engine maturity and utilization of engines varies
⢠Data in Hadoop âportableâ, can be read by other engines
⢠Example: Cisco Data Virtualization Platform (formerly
Composite Software)
9. 9
Shift from a Single Platform to an Ecosystem
âWe will abandon the old
models based on the
desire to implement for
high-value analytic
applications.â
"Logical" Data Warehouse
10. 10
⢠Pick Your Best-of Breed Technology:
â Data types
â Analytic engines
â Economic options
â File systems
â Operating systems
⢠With Different Characteristics:
â CPU centric
â I/O centric
â Data volume centric
â Workload characteristics and
volume
â Availability/DR
â Service Level Agreements
Data Fabric Vision Enabled by QueryGrid
Analytic Flexibility to meet your business needs
Users direct their queries to a single
cohesive data fabric
Focus on data and business questions,
not integrating separate systems
11. 11
Customer Value Based on Social Influence
Use Case
HADOOP
TERADATA
ASTER
DATABASE
TERADATA
DATABASE
⢠Determine high value
customers based on history
⢠Determine customer value
based on social influence
<=
⢠Determine
customer
sentiment
⢠Determine
customer
sphere of
influence
$$
12. 12 Š 2014 Teradata
Teradata QueryGridâ˘
⢠Automated and optimized work
distribution through âpush-downâ
processing across platforms
â Minimize data movement, process
data where it resides
â Minimize data duplication
â Transparently automate analytic
processing and data movement
between systems
â Bi-directional data movement
⢠Run the right analytic on the
right platform
â Take advantage of specialized
processing engines while
operating as a cohesive analytic
environment
⢠Integrated processing; within
and outside the UDA
⢠Easy access to data and
analytics through existing SQL
skills and tools
Optimize, simplify, and orchestrate processing
across and beyond the Teradata UDA
13. 13
INTEGRATED DATA WAREHOUSE
TERADATA DATABASE
DATA
PLATFORM
HADOOP
Teradata Database 15 â Teradata QueryGrid
Leverage analytic resources, reduce data movement
⢠Parallel Bi-directional
data transfer
⢠Push-down processing
⢠Native Analytics on
Target system
⢠Easy configuration of
server connections
⢠Simplified Server
Grammar
⢠Adaptive Optimizer
14. 14
Deep History â QueryGrid Teradata 15.00
Use Case
SELECT Trans.Trans_ID
,Trans.Trans_Amount
FROM TD_Transactions Trans
WHERE Trans_Amount > 5000
UNION
SELECT *
FROM FOREIGN TABLE
(SELECT Trans_ID
,Trans_Amount
FROM Transaction_Hist
WHERE Trans_Amount > 5000)@Hadoop Hist;
HADOOP
TERADATA
DATABASE
â Push "Foreign Table" Select to Hive to execute the query
â Provides import to Teradata of just the required columns.
â Allows predicate processing of conditions on non-partitioned columns.
â The Hadoop cluster resources are used for data qualification.
Years
5-10
Years
1-5
15. 15
Incremental planning & execution of smaller
query fragments
⢠Most efficient overall query plan derived from
reliable statistics
â Statistics dynamically collected from foreign data
⢠Incremental query plans generated for single
and multi-system queries
â Consistent Optimizer approach for queries within and
between systems
â Teradata systems âtransferâ query plans between
systems
⢠A fully automatic optimizer feature â users donât
have to change anything
Adaptive Optimizer
Better Query
Plan
Foreign and Sub-Queries
Why?
Unreliable statistics can result in less-than-
optimal query plans
Some analytic systems, like Hadoop,
donât keep data statistics
Statistics not designed for compatibility
between databases
How?
Pulls out remote server requests and
single-row and scalar non-correlated sub-
queries from a main query
Plans and executes them
Plugs the results into the main query
Plans and executes the main query
â
17. 17
QueryGrid Architecture Advantages
⢠Designed for analytics across the enterprise
â Generalized architecture with tuned connections
â Extends Integrated Data Warehouse
- Not trying to be general purpose top tier engine
⢠Combines core curated data warehouse and other data repositories
(e.g., data lake)
⢠Minimizes layers and data movement
â Combines processing engines without extra control or virtualization layer
â Pushes processing to data
â Moves only data required for processing with data on or analytics of other
system
Š 2014 Teradata
18. 18
DATAMART
1990âs
Just Give Me
Some Data
and Fast!
EDW/IDW
2000âs
Give Me
Good Data
But Do It
Efficiently!
LOGICALDATAWAREHOUSE
2010âs
Give Me
All Data
Fast, Simple &
Effectively!
19. 19
⢠Complex Multi-Structured Data
set
⢠Huge Volumes
⢠Retention of Data Sets a problem
⢠Difficulty in correlating the data
sets
Š 2014 Teradata
Teradata QueryGrid: A Customer Example
Warranty Data Analysis
20. 20
Pareto Rule
The largest data set may not be the most complex
Š 2014 Teradata
20% 80%
80% 20%
80% 20%
Complexity
Volume
Queries
21. 21
The Pareto Effect in Factory Test Data
Š 2014 Teradata
Factory Test
Data
100%
Repair_Order
0.76%
Test_header
0.25%
Repair_details
0.03%
Flash_file
0.02%
Result
95.75%
PSN
0.13%
Repair_Fail
0.95%
Log
0.83%