As presented at OGh SQL Celebration Day in June 2016, NL. Covers new features in Big Data SQL including storage indexes, storage handlers and ability to install + license on commodity hardware
2. info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
â˘Many customers and organisations are now running initiatives around âbig dataâ
â˘Some are IT-led and are looking for cost-savings around data warehouse storage + ETL
â˘Others are âskunkworksâ projects in the marketing department that are now scaling-up
â˘Projects now emerging from pilot exercises
â˘And design patterns starting to emerge
Many Organisations are Running Big Data Initiatives
3. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Gives us an ability to store more data, at more detail, for longer
â˘Provides a cost-effective way to analyse vast amounts of data
â˘Hadoop & NoSQL technologies can give us âschema-on-readâ capabilities
â˘Thereâs vast amounts of innovation in this area we can harness
â˘And itâs very complementary to Oracle BI & DW
Why is Hadoop of Interest to Us?
4. info@rittmanmead.com www.rittmanmead.com @rittmanmead 4
â˘Mark Rittman, Co-Founder of Rittman Mead
âŁOracle ACE Director, specialising in Oracle BI&DW
âŁ14 Years Experience with Oracle Technology
âŁRegular columnist for Oracle Magazine
â˘Author of two Oracle Press Oracle BI books
âŁOracle Business Intelligence Developers Guide
âŁOracle Exalytics Revealed
âŁWriter for Rittman Mead Blog :â¨
http://www.rittmanmead.com/blog
â˘Email : mark.rittman@rittmanmead.com
â˘Twitter : @markrittman
About the Speaker
5. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Flexible Cheap Storage for Logs, Feeds + Social Data
$50k
Hadoop
Node
Voice + Chat
Transcripts
Call Center LogsChat Logs iBeacon Logs Website LogsCRM Data Transactions Social FeedsDemographics
Raw Data
Customer 360 Apps
Predictive â¨
Models
SQL-on-Hadoop
Business analytics
Real-time Feeds,â¨
batch and API
8. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Oracle Big Data Appliance - Engineered System for running Hadoop alongside Exadata
â˘Oracle Big Data Connectors - Utility from Oracle for feeding Hadoop data into Oracle
â˘Oracle Data Integrator EE Big Data Option - Add Spark, Pig data transforms to Oracle ODI
â˘Oracle BI Enterprise Edition - can connect to Hive, Impala for federated queries
â˘Oracle Big Data Discovery - data wrangling + visualization tool for Hadoop data reservoirs
â˘Oracle Big Data SQL - extend Oracle SQL â¨
language + processing to Hadoop
Oracle Software Initiatives around Big Data
14. info@rittmanmead.com www.rittmanmead.com @rittmanmead
Where Can SQL Processing Be Useful with Hadoop?
â˘Hadoop is not a cheap substitute for enterprise DW platforms - donât use it like this
â˘But adding SQL processing and abstraction can help in many scenarios:
⢠Query access to data stored in Hadoop as an archive
⢠Aggregating, sorting, filtering and transforming data
⢠Set-based transformation capabilities for other frameworks (e.g. Spark)
⢠Ad-hoc analysis and data discovery in-real time
⢠Providing tabular abstractions over complex datatypes
SQL!
Though â¨
SQLâ¨
isnât actuallyâ¨
relational
According
to Chris Dateâ¨
SQL is justâ¨
mappings
Tedd Coddâ¨
used â¨
Predicateâ¨
Calculus
and thereâsâ¨
never beenâ¨
a mainstreamâ¨
relationalâ¨
DBMS
but it is theâ¨
standardâ¨
language forâ¨
RDBMSs
and itâs greatâ¨
for set-basedâ¨
transformsâ¨
& queries
soâ¨
Yes SQL!
15. info@rittmanmead.com www.rittmanmead.com @rittmanmead 15
â˘Original developed at Facebook, now foundational within the Hadoop project
â˘Allows users to query Hadoop data using SQL-like language
â˘Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON)
â˘Generates MapReduce code to return required data
â˘Extensible through SerDes and Storage Handlers
â˘JDBC and ODBC drivers for most platforms/tools
â˘Perfect for set-based access + batch ETL work
Apache Hive : SQL Metadata + Engine over Hadoop
16. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Hive uses a RBDMS metastore to holdâ¨
table and column definitions in schemas
â˘Hive tables then map onto HDFS-stored files
âŁManaged tables
âŁExternal tables
â˘Oracle-like query optimizer, compiler,â¨
executor
â˘JDBC and OBDC drivers,â¨
plus CLI etc
16
How Does Hive Translate SQL into MapReduce?
Hive Thrift
Server
JDBC / ODBC
Parser Planner
Execution Engine
Metastore
MapReduc
e
HDFS
HueCLI
17. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Hive uses a RBDMS metastore to holdâ¨
table and column definitions in schemas
â˘Hive tables then map onto HDFS-stored files
âŁManaged tables
âŁExternal tables
â˘Oracle-like query optimizer, compiler,â¨
executor
â˘JDBC and OBDC drivers,â¨
plus CLI etc
17
How Does Hive Translate SQL into MapReduce?
hive> select count(*) from src_customer;
â¨
Total MapReduce jobs = 1â¨
Launching Job 1 out of 1â¨
Number of reduce tasks determined at compile time: 1â¨
In order to change the average load for a reducer (in bytes):â¨
set hive.exec.reducers.bytes.per.reducer=â¨
In order to limit the maximum number of reducers:â¨
set hive.exec.reducers.max=â¨
In order to set a constant number of reducers:â¨
set mapred.reduce.tasks=â¨
Starting Job = job_201303171815_0003, Tracking URL = â¨
http://localhost.localdomain:50030/jobdetails.jspâŚâ¨
Kill Command = /usr/lib/hadoop-0.20/bin/â¨
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 â¨
-kill job_201303171815_0003â¨
â¨
2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%â¨
2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%â¨
2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%â¨
2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%â¨
Ended Job = job_201303171815_0003â¨
OKâ¨
25â¨
Time taken: 22.21 seconds
HiveQLâ¨
Query
MapReduceâ¨
Job submitted
Results â¨
returned
19. info@rittmanmead.com www.rittmanmead.com @rittmanmead 19
â˘Clouderaâs answer to Hive query response time issues
â˘MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access
â˘Mostly in-memory, but spills to disk if required
â˘Uses Hive metastore to access Hive table metadata
â˘Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
20. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery
â˘Inspired by Google Dremel, innovation is querying raw data with schema optional
â˘Automatically infers and detects schema from semi-structured datasets and NoSQL DBs
â˘Join across different silos of data e.g. JSON records, Hive tables and HBase database
â˘Aimed at different use-cases than Hive - â¨
low-latency queries, discovery â¨
(think Endeca vs OBIEE)
Apache Drill - SQL for Schema-Free Data Discovery
21. info@rittmanmead.com www.rittmanmead.com @rittmanmead 21
â˘A replacement for Hive, but uses Hive concepts andâ¨
data dictionary (metastore)
â˘MPP (Massively Parallel Processing) query engineâ¨
that runs within Hadoop
âŁUses same file formats, security,â¨
resource management as Hadoop
â˘Processes queries in-memory
â˘Accesses standard HDFS file data
â˘Option to use Apache AVRO, RCFile,â¨
LZO or Parquet (column-store)
â˘Designed for interactive, real-timeâ¨
SQL-like access to Hadoop
How Impala Works
Impala
Hadoop
HDFS etc
BI Server
Presentation Svr
Cloudera Impalaâ¨
ODBC Driver
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Impala
Hadoop
HDFS etc
Multi-Nodeâ¨
Hadoop Cluster
24. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Originally part of Oracle Big Data 4.0 (BDA-only)
âŁAlso required Oracle Database 12c, Oracle Exadata Database Machine
â˘Extends Oracle Data Dictionary to cover Hive
â˘Extends Oracle SQL and SmartScan to Hadoop
â˘Extends Oracle Security Model over Hadoop
âŁFine-grained access control
âŁData redaction, data masking
âŁUses fast c-based readers where possibleâ¨
(vs. Hive MapReduce generation)
âŁMap Hadoop parallelism to Oracle PQ
âŁBig Data SQL engine works on top of YARN
âŁLike Spark, Tez, MR2
Oracle Big Data SQL
Exadataâ¨
Storage Servers
Hadoopâ¨
Cluster
Exadata Databaseâ¨
Server
Oracle Bigâ¨
Data SQL
SQL Queries
SmartScan SmartScan
25. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘As with other next-gen SQL access layers, uses common Hive metastore table metadata
â˘leverages Hadoop standard APIs for HDFS file access, metadata integration etc
Leverages Hive Metastore and Hadoop file access APIs
26. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Brings query-offloading features of Exadataâ¨
to Oracle Big Data Appliance
â˘Query across both Oracle and Hadoop sources
â˘Intelligent query optimisation applies SmartScanâ¨
close to ALL data
â˘Use same SQL dialect across both sources
â˘Apply same security rules, policies, â¨
user access rights across both sources
Extending SmartScan, and Oracle SQL, Across All Data
27. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Read data from HDFS Data Node
âŁDirect-path reads
âŁC-based readers when possible
âŁUse native Hadoop classes otherwiseâ¨
â˘Translate bytes to Oracleâ¨
â˘Apply SmartScan to Oracle bytes
âŁApply filters
âŁProject columns
âŁParse JSON/XML
âŁScore models
How Big Data SQL Accesses Hadoop (HDFS) Data
Disks%
Data$Node$
Big$Data$SQL$Server$
External$Table$Services$
Smart$Scan$
RecordReader%
SerDe%
10110010%10110010%10110010%
1%
2%
3%
1
2
3
28. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘âQuery Franchising â dispatch of query processing to self-similar compute agents on
disparate systems without loss of operational fidelityâ
â˘Contrast with OBIEE which provides a query federation capability over Hadoop
â˘Sends sub-queries to each data source
â˘Relies on each data sourceâs native query engine, and resource management
â˘Query franchising using Big Data SQL ensures consistent resource management
â˘And contrast with SQL translation tools (i.e. Oracle SQL to Impala)
â˘Either limits Oracle SQL to the subset that Hive, Impala supports
â˘Or translation engine has to transform each Oracle feature into Hive, Impala SQL
Query Franchising vs. SQL Translation / Federation
29. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata
âŁLinked by Exadata configuration steps to one or more BDA clusters
â˘DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata
â˘Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
View Hive Table Metadata in the Oracle Data Dictionary
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;
DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log
30. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Big Data SQL accesses Hive tables through external table mechanism
âŁORACLE_HIVE external table type imports Hive metastore metadata
âŁORACLE_HDFS requires metadata to be specified
â˘Access parameters cluster and tablename specify Hive table source and BDA cluster
Hive Access through Oracle External Tables + Hive Driver
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));
31. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Run normal Oracle SQL from the Oracle Database server
â˘Big Data SQL query franchising then uses agents on Hadoop nodes to query and return
data independent of YARN scheduling; Oracle Database combines and returns full results
Running Oracle SQL on Hadoop Data Nodes
SELECT w.sess_id,w.cust_id,c.name
FROM web_logs w, customers c
WHERE w.source_country = âBrazilâ
AND c.customer_id = w.cust_id
32. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘OBIEE can access Hadoop data via Hive, but itâs slow
â˘(Impala only has subset of Oracle SQL capabilities)
â˘Big Data SQL presents all data to OBIEE as Oracle data, with full advanced analytic
capabilities across both platforms
Example : Combining Hadoop + Oracle Data for BI
Hive Weblog Activity table
Oracle Dimension lookup tables
Combined outputâ¨
in report form
33. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Not all functions can be offloaded to Hadoop tier
â˘Even for non-offloadable operations Big Data SQL will perform column pruning and
datatype conversion (which saves a lot of resources)
â˘Other operations (non-offloadable) will be done on the database side
â˘Requires Oracle Database 12.1.0.2 + patchset, and per-disk licensing for Big Data SQL
â˘You need and Oracle Big Data Appliance, and Oracle Exadata, to use Big Data SQL
Restrictions when using Oracle Big Data SQL
SELECT NAME FROM v$sqlfn_metadata WHERE offloadable ='YES'
34. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘From Big Data SQL 3.0, commodity hardware can be used instead of BDA and Exadata
â˘Oracle Database 12.1.0.2 on x86_64 with Jan/Apr Proactive Bundle Patches
â˘Cloudera CDH 5.5 or Hortonworks HDP 2.3 on RHEL/OEL6
â˘See MOS Doc ID 2119369.1 - note cannot mix Engineered/Non-Engineered platforms
Running Big Data SQL on Commodity Hardware
35. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘No functional differences when running Big Data SQL on commodity hardware
â˘External table capability lives with the database, and the performance functionality with
the BDS cell software.
â˘All BDS features (SmartScan, offloading, storage indexes etc still available)
â˘But hardware can be a factor now, as weâre pushing processing down and data up the wire
â˘1GB ethernet can be too slow, 10Gb is a minimum (i.e. no InfiniBand)
â˘If you run on an undersized system you may see bottlenecks on the DB side.Â
Big Data SQL on Commodity Hardware Considerations
38. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Subsequent releases of Big Data SQL have extended its Hadoop capabilties
â˘Support for Hive storage handlers (HBase, MongoDB etc)
â˘Hive partition elimination
â˘Better, more efficient access to Hadoop data
â˘Storage Indexes
â˘Predicate Push-Down for Parquet, ORC, HBase, Oracle NoSQL
â˘Bloom Filters
â˘Coming with Oracle Database 12.2
â˘Big Data-aware optimizer
â˘Dense Bloom Filters
â˘Oracle managed Big Data partitions
Going beyond Fast Unified Query Access to HDFS Data
39. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Hive Storage handlers give Hive the ability
to access data from non-HDFS sources
â˘MongoDB
â˘HBase
â˘Oracle NoSQL database
â˘Run HiveQL queries against NoSQL DBs
â˘From BDS1.1, Hive storage handlers can
be used with Big Data SQL
â˘Only MongoDB, HBase and NoSQL
currently âsupportedâ
â˘Others should work but not tested
Big Data SQL and Hive Storage Handlers
40. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Create Hive table over HBase database as normal
â˘Typically done to add INSERT and DELETE capabilities to Hive, for DW dimension ETL
â˘Create Oracle external table as normal, using ORACLE_HIVE driver
Use of Hive Storage Handlers Transparent to BDS
CREATE EXTERNAL TABLE tablename colname coltype[, colname coltype,...]
ROW FORMAT
SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'serialization.format'='1',
'hbase.columns.mapping'=':key,value:key,value:
CREATE TABLE tablename(colname colType[, colname colType...])
ORGANIZATION EXTERNAL
(TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(access parameters)
)
REJECT LIMIT UNLIMITED;
41. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘From Big Data SQL 2.0, Storage Indexes
are automatically created in Big Data SQL
agents
â˘Check index before reading blocks â Skip
unnecessary I/Os
â˘An average of 65% faster than BDS 1.x
â˘Up to 100x faster for highly selective
queries
â˘Columns in SQL are mapped to fields in
the HDFS file via External Table Definitions
â˘Min / max value is recorded for each
HDFS Block in a storage index
Big Data SQL Storage Indexes
42. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Hadoop supports predicate push-down through several mechanisms (filetypes, Hive
partition pruning etc)
â˘Original BDS 1.0 supported Hive predicate push-down as part of SmartScan
â˘BDS 3.0 extends this by pushing SARGable (Search ARGument ABLE) predicates
â˘Into Parquet and ORCFile to reduce I/O when â¨
reading files from disk
â˘Into HBAse and Oracle NoSQL database â¨
to drive subscans of data from remote DB
â˘Oracle Database 12.2 will add more optimisations
â˘Columnar-caching
â˘Big Data-Aware Query Optimizer,
â˘Managed Hadoop partitions
â˘Dense Bloom Filters
Extending Predicate Push-Down Beyond Hive
43. info@rittmanmead.com www.rittmanmead.com @rittmanmead
â˘Typically a one-way street - queries run in Hadoop but results delivered through Oracle
â˘What if you want to load data into Hadoop, update data, do Hadoop>Hadoop transforms?
â˘Still requires formal Hive metadata, whereas direction is towards Drill & schema-free queries
â˘What if you have other RDBMSs as well as Oracle RDBMS?
â˘Trend is towards moving all high-end analytic workloads into Hadoop - BDS is Oracle-only
â˘Requires Oracle 12c database, no 11g support
â˘And cost ⌠BDS is $3k/Hadoop disk drive
â˘Can cost more than an Oracle BDA
â˘High-end, high-cost Oracle-centric solution
â˘of course!
⌠So Whatâs the Catch?