Find out how Hortonworks and IBM help you address these challenges to enable success to optimize your existing EDW environment.
https://hortonworks.com/webinar/modernize-existing-edw-ibm-big-sql-hortonworks-data-platform/
4. 4
IBM Confidential
Hortonworks Data Platform & IBM BIG SQL
the Bridge for Customers
• #1 Pure Open Source Hadoop
Distribution
• 1000+ customers and 2100+
ecosystem partners
• Compatibility
• Capabilities & Overlap between Hive
(HWX) and Big SQL (IBM)
• Leader in SQL technology for Hadoop
• Leader in on premise and hybrid cloud
data and analytics solutions
• Federation
• High Performance
• Improved Concurrence
• Complex queries
• Enterprise security features
7. IBM Cloud
Advanced Analytics
Speed
Scalability
Cost
Productivity
• Build a centralized data repository (virtual/physical) to support business
decisions.
• Bring together siloed data available either locally or by federation to gain a
variety of benefits
• Modern data warehouse architecture enables deeper analytics and
advanced reporting from diverse sets of data.
Benefits
• Cost efficiency
• Deeper analytics (predictive, proactive
and historical analytics)
• Faster and efficient to do the analytics
• Enrich or augment warehouse data
What is EDW Modernization?
8. IBM Cloud
Want to modernize
your EDW without
long and costly
migration efforts
Offloading historical
data from Oracle,
Db2, Netezza
because reaching
capacity
Operationalize
machine learning
Need to query,
optimize and
integrate multiple
data sources from
one single endpoint
Slow query
performance for SQL
workloads
Require skill set to
migrate data from
RDBMS to
Hadoop/Hive
Do you have any of these challenges?
9. IBM Cloud
Compatible with Oracle, Db2 & Netezza SQL syntax
Modernizing EDW workloads on Hadoop has never been easier
Application portability (eg: Cognos, Tableau, MicroStrategy,…)
Federates all your data behind a single SQL engine
Query Hive, Spark and HBase data from a single endpoint
Federate your Hadoop data using connectors to Teradata, Oracle, Db2 & more
Query data sources that have Spark connectors
Addresses a skillset gap needed to migrate technologies
Delivers high performance & concurrency for BI workloads
Unlock Hadoop data with analytics tools of choice
Provides greater security while accessing data
Robust SQL based row filtering, column masking with role-based access control and
Ranger integration
Operationalize machine learning through integration with Spark
Bi-directional integration with Spark exploits Spark’s connectors as well as ML
capabilities
Here’s How Big SQL Addresses These Challenges
10. IBM Cloud
Big SQL queries heterogeneous systems in a single query - only SQL-on-Hadoop that virtualizes more than 10
different data sources: RDBMS, NoSQL, HDFS or Object Store
Big SQL
Fluid Query (federation)
Oracle
SQL
Server
Teradata DB2
Netezza
(PDA) Informix
Microsoft
SQL Server
Hive HBase HDFS
Object Store
(S3)
WebHDFS
Big SQL allows query federation by virtualizing data sources and processing where data resides
Hortonworks Data Platform (HDP)
Data Virtualization
11. IBM Cloud
Automatic memory management and workload management handles
complex query execution without failures
Executes all 99 TPC-DS queries in single stream and multi-stream environment
Easy porting of enterprise applications
Ability to work seamlessly with Business Intelligence tools like Cognos,
Tableau, etc. to gain insights
Oracle
SQL
DB2
SQL
Netezza
SQL
Big SQL
SQL syntax tolerance (ANSI SQL Compliant)
Cognos Analytics
InfoSphere Metadata Asset Manager
Big SQL is a synergetic SQL engine that offers SQL compatibility, portability and collaborative
ability to get composite analysis on data
Data Offloading and Analytics
12. IBM Cloud
BRANCH_A FINANCE
(security admin)BRANCH_B
Role Based Access Control
enables separation
of Duties / Audit
Row Level Security
Row and Column Level Security
Big SQL offers row and column level access control (RBAC) among other security settings
Data Security
13. IBM Cloud
Big SQL
Head
NM NM NM NM NM NM
HDFS
Slider Client
YARN
Resource
Manager
&
Scheduler
Big SQL
AM
Users
Big
SQL
Work
er
Big
SQL
Work
er
Big
SQL
Work
er
Big
SQL
Work
er
Big
SQL
Work
er
Big
SQL
Work
er Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Worker
Launch Multiple Workers per Host
More Granular Elasticity
For both 1 and 10 TB TPC-DS dataset
2 Workers/Node: 1.6x speedup
4 Workers/Node: 2.2x speedup
In each scenario,
the same TOTAL
CPU/memory is
used
Performance: Big SQL Elastic Boost
14. IBM Cloud
Enhancements to ORC file format has improved performance when compared to previous releases and it at
par with Parquet file format
(on v5.0.1)(on v4.2)
Performance: Parquet vs ORC File Format
15. IBM Cloud
BigSQLworker
Sparkexecutor
Share data in
memory
Spark 2.1 is a powerful analytic co-processor that complements
the rich SQL functionality of Big SQL
Tight integration with Spark enables Big SQL
worker and Spark Executor to communicate in
memory without writing to disk
Bi-directional integration allows Spark
jobs can be executed from Big SQL
HDFS
Big SQL is a self-tuning memory management SQL engine that integrates with Spark 2.1
Spark Integration
16. IBM Cloud
Data Ingestion
Data Transformation/
Data Science/
Machine Learning
Data Visualization
Leverage Big SQL throughout your journey
Virtualize disparate data sources
like Hadoop, RDBMS, and Object
Stores (S3) to join data in a single
query
Manipulate data and
operationalize data science
models written in various
languages
Perform data discovery, analyze,
and visualize business results in
notebooks or other BI tools
Democratize Data Science and Machine Learning
17. IBM Cloud
Big SQL
Archive Data away from EDW
• Move cold or rarely used data to Hadoop
as active archive
• Store more of data longer
Offload costly ETL process
• Free your EDW to perform high-value functions like analytics
& operations, not ETL
• Use Hadoop for advanced ETL
Optimize the value of your EDW
• Use Hadoop to refine new data sources, such as web and
machine data for new analytical context
• Access old data in traditional RDBMS using Big SQL’s
federation
• Combine data in different silos without duplication
Exploit Spark to avail latest technologies
• Spark integration brings data from NoSQL databases and
other non-traditional sources
• Make Spark’s use cases secure with granular security
defined in Big SQL
Reduce the migration effort & skillset gap
• Use existing investment in Oracle, Db2 and Netezza
technology skills
• Big SQL allows you to migrate applications with out major
code rewrites and additional SQL development
ANALYTICSDATASYSTEMS
Data
Marts
Business
Analytics
Visualization
& Dashboards
Systems of Record
RDBMS
ERP
CRM
Other
Clickstream Web & Social Geolocation Sensor
& Machine
Server
Logs
Unstructured
NEWSOURCES
HDP 2.6
ELT
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
°
N
Cold Data,
Deeper Archive
& New Sources
Enterprise Data
Warehouse
Hot
Realize Cost Savings Faster with Big SQL
Top Use case – EDW Optimization
On June 13, 2017, we were very excited to announce an expansion to our partnership with Hortonworks to bring Data Science to the world on an Open platform.
This partnership will bring together IBM's Data Science Experience, IBM Machine Learning and IBM Big SQL on top of Hortonworks Data Platform, to form new integrated solutions. These solutions will help extend IBM's leadership in data science and machine learning to everyone from data scientists to business leaders to analyze and manage their data.
We specifically announced :
Hortonworks adopts IBM Data Science Experience (DSX) as its Data Science Platform.
Hortonworks adopts IBM BigSQL as its SQL engine for complex analytics on Hadoop.
IBM will leverage Hortonworks HDP as the core Hadoop distribution for BigSQL and DSX
Hello, my name is Jessica Lee and I am the Big SQL Offering Manager. You can see my email on this slide.
Today I am going to walk through the pricing and packaging for IBM Big SQL 5.0, and HDP and HDF for IBM (Hortonworks resell products)
The last section of this deck will also provide information on BigInsights Entitlement migration
Some of these challenges might sound familiar
Moderning your existing EDW without long and costly migration efforts
Querying across multiple types of data management
Lack ability to query across traditional data warehouse and cloud based data systems
Lack of skill set to migrate data from RDBMS to Hadoop/Hive
Optimzing and integrating external data sources with existing data sources
Offloading and porting workloads from Oracle, Db2 & Netezza
Difficult to offload workloads from EDW platforms
Performance
Slows down once too many users access the system
Interactive query performance is unacceptable
Operational efficiency around older data warehouse environments
Requires tools to help with ease of use and automation to manage workloads and schedule jobs
4 things to remember
Compatible with Oracle, Db2, Netezza
Modernizing EDW Workloads on Hadoop
App portability
Federates data behind a single SQL engine
Uses connectors to Teradata, Oracle, Db2, Netezza
Addresses a skill gap needed to migrated technologies
People hate migrations on rewrite code – interrupts business process – takes other resources to do this work
Skills gap to do this. Want an engine that is 100% compatible
With Big SQL – you can go into your accounts with Netezza, Oracle, DB2 – they can move part of their RBDMS Dataware house to hadoop without having to change any code
Ask your customer – do you want to optimize your oracle, netezza, db2 workloads by moving them to Hive
Sick of Oracle, Netezza? We got a solution to help you get them out.
Finally – it delivers that high performance for complex SQL workloads
Federation consists of the following: a federated server, federated database, data sources, clients which can be users and applications to access both local database and database from data sources.Federation is known for its strengths:- Transparent: Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database
- Extensible: Update data in relational data sources, as if the data is stored in the federated database
- Autonomous: Move data to and from relational data sources without interruptions
- High Function: Take advantage of the data source processing strengths, by sending requests to the data sources for processing
- High Performance: Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server
Below is a list of data sources supported for Big SQL. For the latest information, visit
https://www-304.ibm.com/support/entdocview.wss?uid=swg27044495
Data Source Supported Versions Notes
DB2 for z/OS® 8.x, 9.x, 10.x
DB2®DB2 for LUW. 9.7, 9.8, 10.1, 10.5
Oracle 11g, 11gR1, 11g R2, 12c
Teradata 12, 13, 14, 15 Not supported on POWER systems.
Netezza 4.6, 5.0, 6.0, 7.2 Not supported on POWER systems.
Informix 11.5
Microsoft SQL Server 2012, 2014
ODBC 3.0 or later
Big SQL now comes pre-bundled with DataDirect drivers from Progress that eliminates the need for downloading drivers and enables easy setup
Spark connectors enhances Big SQL’s connectivity to other data sources and also operationalize machine learning models
Federation consists of the following: a federated server, federated database, data sources, clients which can be users and applications to access both local database and database from data sources.Federation is known for its strengths:- Transparent: Correlate data from local tables and remote data sources, as if all the data is stored locally in the federated database
- Extensible: Update data in relational data sources, as if the data is stored in the federated database
- Autonomous: Move data to and from relational data sources without interruptions
- High Function: Take advantage of the data source processing strengths, by sending requests to the data sources for processing
- High Performance: Compensate for SQL limitations at the data source by processing parts of a distributed request at the federated server
Below is a list of data sources supported for Big SQL. For the latest information, visit
https://www-304.ibm.com/support/entdocview.wss?uid=swg27044495
Data Source Supported Versions Notes
DB2 for z/OS® 8.x, 9.x, 10.x
DB2®DB2 for LUW. 9.7, 9.8, 10.1, 10.5
Oracle 11g, 11gR1, 11g R2, 12c
Teradata 12, 13, 14, 15 Not supported on POWER systems.
Netezza 4.6, 5.0, 6.0, 7.2 Not supported on POWER systems.
Informix 11.5
Microsoft SQL Server 2012, 2014
ODBC 3.0 or later
Big SQL is ANSI compliant, therefore it can run Oracle SQL, DB2 SQL and Netezza SQL.
Big SQL tables can be created for data residing in HDFS, Hive, Hbase, Object Store and WebHDFS.
Big SQL is aware of remote indexes and table statistics to optimize federated queries.
Big SQL provides:
a unified view of all your tables, with federated query support to external databases
stored optimally as Hive, HBase, or read with Spark - optimized for the expected workload
secured under a single security model (including row/column security across all)
with the ability to join across all datasets using standard ANSI SQL across all types of tables
with Oracle, NZ, DB2 extensions if you prefer
using a single database connection and driver
No other SQL engine for Hadoop even comes close.
For Albert, from Branch A, only rows from his branch are listed because he has access to records with same branch name as his.
Bonnie, from Branch B, can only view records from her branch and does have access to view salary.
But Cindy, a security administrator, can view records from all branches and also view all columns.
When customers take the time to review the security capabilities of other SQL Engines on Hadoop, they realize a lot of things they are used to are missing. Big SQL supports all of the important features.
Separation of Duties – an important feature if you really want to operationalize the environment. Most enterprise customers don’t like having a database “super user” who can manage the environment and also see all data (like root access for the SQL engine). Big SQL can define Database Administrator roles that allow administration of the environment without permitting access to the data, for example.
Big SQL also provide row level and column level security using row permissions (CREATE PERMISSION) or column masking (CREATE MASK) on the tables.
Note that both Big SQL ROLES (same concept as DB2 or Oracle roles) and OS Groups (Local and LDAP) are supported.
Instead of running one (potentially very large memory) worker per host, elastic boost enables the ability to run multiple smaller Big SQL workers per host.
We can start and stop workers fully independent of each other. Therefore, while the same amount of maximum memory/CPU is consumed when all workers are on…. we get much more granularity of resource usage in terms of scaling up and down capacity with more workers.
We have found that for INSERT… SELECT FROM workloads (common for EL-T type workloads), elastic boost can improve performance by 1.6X for 2 workers/node and 2.2x for 4 workers per node.
Note that for each of these bars, physical CPU, memory, disk, and network is fixed. We are only increasing # of workers.
#comment: this chart may be confusing as it looks like we’re comparing ORC vs. Parquet. This is not the case as we’re actually showing two different scale factors (1TB, 10TB) and just trying to get test coverage with the least amount of work. Both formats showed similar improvements in performance.
https://developer.ibm.com/hadoop/2017/03/15/logical-big-sql-workers-boost-big-sql-performance/
Big SQL Head node can actually start its own driver program and spawn Spark executors. From there, Spark Executors can “pair up” with Big SQL workers and now workers can communicate with each other, in memory, without first writing to disk.
Further, RDD partitions can be fed directly to Big SQL workers without passing through the driver or performing disk I/O.
In Big SQL 5.0, Big SQL is tightly integrated with Spark. This integration enables the development of hybrid applications where Spark jobs can be executed from Big SQL SELECT statements and the results efficiently streamed in parallel from Spark to Big SQL.
Big SQL applications can treat Spark as a powerful analytic co-processor that complements the rich SQL functionality that is available in Big SQL. Big SQL users, who already enjoy the best SQL performance, richest SQL language, and extensive enterprise capabilities, can also leverage Spark for its non-relational distributed processing capabilities and rich analytic libraries. The capabilities of both engines are seamlessly blended in a single SQL statement, with large volumes of data flowing efficiently between them.
A built-in table function can make Spark functionality directly available at the SQL language level, and a Big SQL SELECT statement can invoke Spark jobs directly and process the results of those jobs as though they were tables. For example, you can use the EXECSPARK table function to invoke Spark jobs from Big SQL
This is the most important use case for using Big SQL – for EDW optimization
End Result – your customer will realized cost savings faster with Big SQL
One of the most unique values that Big SQL brings-
Helps reduce the migration effort and skill set gap.
What does this mean? Customers can use their existing investment in Oracle, Db2 and Netezza technology skills
Big SQL allows customers to migrate their existing applications without major code rewrites and additional SQL development. This is huge!
What distinguishes Big SQL from other SQL-on-Hadoop offerings? Well, I’ve tried to summarize the key characteristics here, categorizing them into 4 broad areas. We’ll dig into many of these features in greater detail later in this presentation.
Note to IBMers: Separate deep-dive presentations and other assets are available on ZACS for federation, performance, and other Big SQL features.