More Related Content Similar to Big SQL 3.0 - Fast and easy SQL on Hadoop (20) More from Wilfried Hoge (8) Big SQL 3.0 - Fast and easy SQL on Hadoop1. © 2014 IBM Corporation
Big SQL 3.0
Fast and easy SQL on Hadoop
Wilfried Hoge
IT Architect Big Data hoge@de.ibm.com @wilfriedhoge
z/OS und LUW
2. Hadoop Observations
Technology Customers Vendors
Rapid innovation
Two sources of innovation
- Open source community
- Integration of existing
technologies
Tools and application
vendors selecting partners
and integrating
High degree of interest
Many experimental
workstreams
ROI establishment varies by
use case
Many customers want to
offload data from EDW
Multiple business models
OSS support vendors have
mindshare lead
OSS support vendors
business model viability
unclear
SW Portfolio vendors
integrating/adding
© 2014 International Business Machines Corporation 2
3. InfoSphere BigInsights
provides Enterprise Grade Hadoop analytics
• Manages a wide variety and huge volume
of data
• Augments open source Hadoop with
enterprise capabilities
– Visualization & Exploration
– Development tools
– Advanced Engines
– Connectors
– Workload Optimization
– Enterprise integration
– Analytic Accelerators
– Application and industry accelerators
– Administration & Security
BIG DATA PLATFORM
Application Discovery
Development
Accelerators
Data
Warehouse
Stream
Computing
Systems
Management
Hadoop
System
Information Integration & Governance
Data Media Content Machine Social
© 2014 International Business Machines Corporation 3
© 2013 IBM Corporation
4. Key Differentiators for BigInsights
Enterprise Performance
& Integration Analytics Usability
& Productivity
• Workload / performance
optimization
• GPFS
• Security
• Key integrations & Connectors
with Enterprise Ecosystem
• Text analytics
• Social Data Analytics
Accelerators
• Machine Data Analytics
Accelerators
• Execute R in an integrated
application
• Big SQL
• BigSheets
• Development Tools
• Web Console
© 2014 International Business Machines Corporation 4
5. Integrated Web Console
• Manage BigInsights
– Inspect /monitor system health
– Add / drop nodes
– Start / stop services
– Run / monitor jobs (applications)
– Explore / modify file system
– Create custom dashboards
• Launch applications
– Spreadsheet-like analysis tool
– Pre-built applications (IBM supplied or
user developed)
• Publish applications
• Monitor cluster, applications, data
– Create / view event alerts.
© 2014 International Business Machines Corporation 5
6. Distributed Filesystem
© 2014 International Business Machines Corporation 6
6
Applications
High level languages
(SQL, JAQL, PIG, …)
Map/Reduce API
Hadoop DFS API
GPFS HDFS
Distributed filesystem GPFS FPO
gives additional flexibility, security
and high availability
• Optional file system alternative to HDFS
• More than 10 years experience with HPC
• Key features
– No single point of failure
– Built-in High Availability
– POSIX compliance
• Standard applications cannot use HDFS
but they can use GPFS-FPO
– Enhanced Security
– Higher performance
• Allows concurrent read and
write by multiple programs
– Recovery capabilties
• Journaling filesystem
– Support for Storage Pools
– SnapShot capability
7. BigInsights has a simple but
effective security system based
on a gateway to Hadoop
Users Sources
• All Hadoop servers are connected over a
private network
• Unrestricted communication between cluster
servers on the private network
• BigInsights Web Console acts as a
gateway into the cluster
• Authentication through PAM or LDAP
• Role based authorization
• Authorization will be enforced at 3 levels:
– UI level
– Data level
– Map-Reduce level
• Authorization also respected by services (e.g. SQL)
• Kerberos support
Authentication
Authority
External
Gateway / Web Console
Services Data
Nodes
Infrastr.
Nodes
Distributed Filesystem
© 2014 International Business Machines Corporation 7
8. BigSheets to analyze and visualize
• Model “big data” collected
from various sources in
spreadsheet-like structures
• Filter and enrich content with
built-in functions
• Combine data in different
workbooks
• Visualize results through
spreadsheets, charts
• Export data into common
formats (if desired)
No programming knowledge needed!
© 2014 International Business Machines Corporation 8
9. Centralized dashboard & data flows
© 2014 International Business Machines Corporation 9
9
A centralized dashboard to
visualize analytic results:
• BigSheets collections
• Analytic application results
• Monitoring metrics
• Ability to view BigSheets data flows between
and across data sets to quickly navigate and
relate analysis and charts
• Visualize inner outer joins, enhanced filters
for BigSheets columns, column data-type
mapping for collections and application of
analytics to BigSheets
columns, … etc
10. Tools for Developers
5. Deploy your
application on the
cluster
© 2014 International Business Machines Corporation 10
10
Editors
• A workflow editor that greatly simplifies the
creation of complex Oozie workflows with a
consumable interface
• A Pig/Jaql Editor with content assist and syntax
highlighting that enables users to create and
execute new applications using Pig or Jaql in
local or cluster mode from the Eclipse IDE
Application development & deployment
• Enablement of BigSheets macro
and BigSheets reader development
• Text Analytics development,
including support for modular
rule sets
• Publish new application: BigSheets
Macro, BigSheets Reader, AQL
module, Jaql module
1. Sample your
Data
2. Develop your
application using
BigInsights tools
3. Test your
application
4. Package and publish your
application
11. Running Applications on Big Data
• Browse available applications
• Deploy published applications
(administrators only)
• Launch (or schedule for launch) a
deployed application
• Monitor job (application) execution
status
• Predefined applications
• Import & Export Data
• Database & Files
• Web and Social
• Analyze and Query
• Predictive Analytics
• Text Analytics
• SQL/Hive, Jaql, Pig, Hbase
• Accelerators
© 2014 International Business Machines Corporation 11
12. Application linking and interfaces to build new apps
• Compose new
applications from
existing applications
and BigSheets
• Invoke analytics
applications from the
web console, including
integration within
BigSheets
• REST data source App
that enables users to
load data from any data source supporting REST APIs into BigInsights,
including popular social media services
• Sampling App that enables users to sample data for analysis
• Subsetting App that enables users to subset data for data analysis
© 2014 International Business Machines Corporation 12
12
13. Collaborative Big Data for many roles
• Business Users can get their hands on big
data and use big data applications and
BigSheets to get insights into their data
§ Data scientists can perform deeper
analysis and get richer insights
§ Administrators are empowered to be
more agile through better controls and
views into key performance indicators
§ Developers can leverage unified tooling in a Big Data
Application Development Lifecycle and are able to
create and deploy new types of applications, with
enhancements that simplify even complex workflows
© 2014 International Business Machines Corporation 13
14. Big SQL 3.0 – Architected for Performance
• Leverage IBM's rich SQL heritage, expertise, and technology
– Modern SQL:2011 capabilities
– DB2 compatible SQL PL support
• SQL bodied functions and stored procedures
• Application logic/security encapsulation
• Architected from the ground up for performance
– low latency and high throughput
• MapReduce replaced with a modern MPP
architecture
– Compiler and runtime are native code (not java)
– Big SQL worker daemons live directly on cluster
– Continuously running (no startup latency)
– Processing happens locally at the data
• Operations occur in memory with the ability
to spill to disk
– Supports aggregations and sorts larger than available RAM
• Integration with BigSheets (source & target)
SQL-based
Application
IBM Data Server Client
Big SQL
SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
InfoSphere BigInsights
© 2014 International Business Machines Corporation 14
15. Big SQL 3.0 – Architecture cont.
• Head (coordinator / management) node
– Listens to the JDBC/ODBC connections and compiles / optimizes the query
– Coordinates the execution of the query
– Optionally store user data in traditional RDBMS table (single node only)
• Big SQL worker processes reside on compute nodes (some or all)
• Worker nodes stream data between each other as needed
• Workers can spill large data sets to local disk if needed
– Allows Big SQL to work with data sets
larger than available memory
Big SQL
Mgmt Node
Hive
Metastore
Mgmt Node
Name Node
Mgmt Node
••• Job Tracker
Mgmt Node
Task
Tracker
Data
Node
Big
SQL
Big
SQL
••• Node Big
SQL
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Big
SQL
Compute Node
GPFS/HDFS
© 2014 International Business Machines Corporation 15
16. Big SQL 3.0 – Features
Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM software
Enhanced by Third Party software
Performance
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Distributed requests to multiple data
sources within a single SQL statement
Main data sources supported:
DB2 LUW, DB2/z, Teradata, Oracle, Netezza
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Federation
Enterprise Features
© 2014 International Business Machines Corporation 16
17. BigSQL Demo
© 2014 International Business Machines Corporation 17
18. Comparing Big SQL 3.0 and Hive 0.12 for Ad-Hoc Queries
3500
3000
2500
2000
1500
1000
500
0
BigSQL
3.0
Parquet
vs
Hive
0.12
ORC
1TB
Classic
BI
Workload
Big SQL is up
to 41x faster
than Hive 0.12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Elapsed
Time
(sec)
Query
number
Hive
0.12
BigSQL
3.0
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload"
in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard,
running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and
TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers,
each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload,
configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014
© 2014 International Business Machines Corporation 18
19. IBM BigInsights brings efficient integration of R with Big R
• R as a big data query language
– Outside-in execution
• R as a statistical language for
deep computing
– Inside-out execution
– Partitioning of large data (“divide”)
– Parallel cluster execution of pushed
down R code (“conquer”)
– Almost any R package can run in
this environment
• R as the gateway to scalable
machine learning
– A scalable ML engine that provides
canned algorithms, and an ability to
author new ones, all via R
R Clients
Pull data
(summaries) to
R client
Scalable
ML
Engine
Data Sources
R Packages
R Packages
Embedded R Execution
Or, push R
functions right
on the data
© 2014 International Business Machines Corporation 19
20. Text Analytics in BigInsights
Distill structured information from
unstructured data
– Rich annotator library supports multiple
languages
– Declarative Information Extraction (IE) system
based on an algebraic framework
– Richer, cleaner rule semantics
– Better performance through optimization
How it works
• Parses text and detects meaning with annotators
• Understands the context in which the text is
analyzed
• Hundreds of pre-built annotators for names,
addresses, phone numbers, along others
Accuracy
• Highly accurate in deriving meaning from
complex text
Performance
• AQL language optimized for MapReduce
Unstructured text (document, email, etc)
Football World Cup 2010, one team
distinguished themselves well, losing to
the eventual champions 1-0 in the Final.
Early in the second half, Netherlands’
striker, Arjen Robben, had a breakaway,
but the keeper for Spain, Iker Casillas
made the save. Winger Andres Iniesta
scored for Spain for the win.
Classification and Insight
© 2014 International Business Machines Corporation 20
21. BigInsights offers value beyond Open Source
Enterprise Capabilities
Visualization & Exploration
Development Tools
Advanced Engines
Connectors
Workload Optimization
Administration & Security
Key differentiators
• Built-in analytics
• Enterprise software integration
• Spreadsheet-style analysis
• Integrated installation of supported open
Open source
components
IBM-certified
Apache
Hadoop
source and other components
• Web Console for admin and application
access
• Platform enrichment: additional security,
performance features, . . .
• World-class support
• Full open source compatibility
Business benefits
• Quicker time-to-value due to IBM
technology and support
• Reduced operational risk
• Enhanced business knowledge with flexible
analytical platform
• Leverages and complements existing
software
© 2014 International Business Machines Corporation 21
22. InfoSphere BigInsights for Hadoop includes the latest Open
Source components, enhanced by enterprise components
IBM InfoSphere BigInsights for Hadoop
Visualization & Ad
Hoc Analytics
BigSheets
Charting Dashboard
Advanced Analytics
R Big R Analytics
Data
Access
Runtime
Data Store
File System
Security
Resource Management &
Oozie
Administration
YARN*
Applications & Development
Governance
Text
Jaql
Eclipse Tooling:
MapReduce, Hive, Jaql,
Pig, Big SQL, AQL
Flume
Sqoop
HCatalog
Hive Pig
MapReduce
HBase
HDFS
BigSheets Reader
and Macro
Text Analytics
Extractors
Stream Computing
Streams
Adaptive MapReduce
Solr/
Lucene
Enterprise
Search
ETL
Big SQL
Open Source IBM
Kerberos
ZooKeeper
Console Monitoring
Audit & History
GPFS FPO
LDAP Data Security for Hadoop
Data Masking Data Matching Data Privacy for Hadoop
Search
Flexible
Scheduler
* In Beta
© 2014 International Business Machines Corporation 22
23. From Getting Starting to Enterprise Deployment:
Different BigInsights Editions For Varying Needs
Enterprise Edition
Standard Edition
- Spreadsheet-style tool
- - Dashboards
- Pre-built applications
- - Eclipse tooling
- - RDBMS connectivity
- - Monitoring and alerts
- - Platform enhancements
- Accelerators
- - GPFS – FPO
- - Adaptive MapReduce
- Text analytics
- Enterprise Integration
- - Big R
- - InfoSphere Streams*
- - Watson Explorer*
- - Cognos BI*
- - Data Click*
- - . . .
- * Limited use license
Breadth of capabilities
Enterprise class
- - Web console
- - Big SQL
- - . . .
Apache
Hadoop
Quick Start
Free. Non-production
Same features as
Standard Edition plus text
analytics and Big R
© 2014 International Business Machines Corporation 23
24. IBM big data • IBM big data • IBM big data
IBM big data • IBM big data • IBM big data
IBM big data • IBM big data
IBM big data • IBM big data
THINK