SlideShare a Scribd company logo
1 of 24
Download to read offline
© 2014 IBM Corporation 
Big SQL 3.0 
Fast and easy SQL on Hadoop 
Wilfried Hoge 
IT Architect Big Data hoge@de.ibm.com @wilfriedhoge 
z/OS und LUW
Hadoop Observations 
Technology Customers Vendors 
Rapid innovation 
Two sources of innovation 
- Open source community 
- Integration of existing 
technologies 
Tools and application 
vendors selecting partners 
and integrating 
High degree of interest 
Many experimental 
workstreams 
ROI establishment varies by 
use case 
Many customers want to 
offload data from EDW 
Multiple business models 
OSS support vendors have 
mindshare lead 
OSS support vendors 
business model viability 
unclear 
SW Portfolio vendors 
integrating/adding 
© 2014 International Business Machines Corporation 2
InfoSphere BigInsights 
provides Enterprise Grade Hadoop analytics 
• Manages a wide variety and huge volume 
of data 
• Augments open source Hadoop with 
enterprise capabilities 
– Visualization & Exploration 
– Development tools 
– Advanced Engines 
– Connectors 
– Workload Optimization 
– Enterprise integration 
– Analytic Accelerators 
– Application and industry accelerators 
– Administration & Security 
BIG DATA PLATFORM 
Application Discovery 
Development 
Accelerators 
Data 
Warehouse 
Stream 
Computing 
Systems 
Management 
Hadoop 
System 
Information Integration & Governance 
Data Media Content Machine Social 
© 2014 International Business Machines Corporation 3 
© 2013 IBM Corporation
Key Differentiators for BigInsights 
Enterprise Performance 
& Integration Analytics Usability 
& Productivity 
• Workload / performance 
optimization 
• GPFS 
• Security 
• Key integrations & Connectors 
with Enterprise Ecosystem 
• Text analytics 
• Social Data Analytics 
Accelerators 
• Machine Data Analytics 
Accelerators 
• Execute R in an integrated 
application 
• Big SQL 
• BigSheets 
• Development Tools 
• Web Console 
© 2014 International Business Machines Corporation 4
Integrated Web Console 
• Manage BigInsights 
– Inspect /monitor system health 
– Add / drop nodes 
– Start / stop services 
– Run / monitor jobs (applications) 
– Explore / modify file system 
– Create custom dashboards 
• Launch applications 
– Spreadsheet-like analysis tool 
– Pre-built applications (IBM supplied or 
user developed) 
• Publish applications 
• Monitor cluster, applications, data 
– Create / view event alerts. 
© 2014 International Business Machines Corporation 5
Distributed Filesystem 
© 2014 International Business Machines Corporation 6 
6 
Applications 
High level languages 
(SQL, JAQL, PIG, …) 
Map/Reduce API 
Hadoop DFS API 
GPFS HDFS 
Distributed filesystem GPFS FPO 
gives additional flexibility, security 
and high availability 
• Optional file system alternative to HDFS 
• More than 10 years experience with HPC 
• Key features 
– No single point of failure 
– Built-in High Availability 
– POSIX compliance 
• Standard applications cannot use HDFS 
but they can use GPFS-FPO 
– Enhanced Security 
– Higher performance 
• Allows concurrent read and 
write by multiple programs 
– Recovery capabilties 
• Journaling filesystem 
– Support for Storage Pools 
– SnapShot capability
BigInsights has a simple but 
effective security system based 
on a gateway to Hadoop 
Users Sources 
• All Hadoop servers are connected over a 
private network 
• Unrestricted communication between cluster 
servers on the private network 
• BigInsights Web Console acts as a 
gateway into the cluster 
• Authentication through PAM or LDAP 
• Role based authorization 
• Authorization will be enforced at 3 levels: 
– UI level 
– Data level 
– Map-Reduce level 
• Authorization also respected by services (e.g. SQL) 
• Kerberos support 
Authentication 
Authority 
External 
Gateway / Web Console 
Services Data 
Nodes 
Infrastr. 
Nodes 
Distributed Filesystem 
© 2014 International Business Machines Corporation 7
BigSheets to analyze and visualize 
• Model “big data” collected 
from various sources in 
spreadsheet-like structures 
• Filter and enrich content with 
built-in functions 
• Combine data in different 
workbooks 
• Visualize results through 
spreadsheets, charts 
• Export data into common 
formats (if desired) 
No programming knowledge needed! 
© 2014 International Business Machines Corporation 8
Centralized dashboard & data flows 
© 2014 International Business Machines Corporation 9 
9 
A centralized dashboard to 
visualize analytic results: 
• BigSheets collections 
• Analytic application results 
• Monitoring metrics 
• Ability to view BigSheets data flows between 
and across data sets to quickly navigate and 
relate analysis and charts 
• Visualize inner outer joins, enhanced filters 
for BigSheets columns, column data-type 
mapping for collections and application of 
analytics to BigSheets 
columns, … etc
Tools for Developers 
5. Deploy your 
application on the 
cluster 
© 2014 International Business Machines Corporation 10 
10 
Editors 
• A workflow editor that greatly simplifies the 
creation of complex Oozie workflows with a 
consumable interface 
• A Pig/Jaql Editor with content assist and syntax 
highlighting that enables users to create and 
execute new applications using Pig or Jaql in 
local or cluster mode from the Eclipse IDE 
Application development & deployment 
• Enablement of BigSheets macro 
and BigSheets reader development 
• Text Analytics development, 
including support for modular 
rule sets 
• Publish new application: BigSheets 
Macro, BigSheets Reader, AQL 
module, Jaql module 
1. Sample your 
Data 
2. Develop your 
application using 
BigInsights tools 
3. Test your 
application 
4. Package and publish your 
application
Running Applications on Big Data 
• Browse available applications 
• Deploy published applications 
(administrators only) 
• Launch (or schedule for launch) a 
deployed application 
• Monitor job (application) execution 
status 
• Predefined applications 
• Import & Export Data 
• Database & Files 
• Web and Social 
• Analyze and Query 
• Predictive Analytics 
• Text Analytics 
• SQL/Hive, Jaql, Pig, Hbase 
• Accelerators 
© 2014 International Business Machines Corporation 11
Application linking and interfaces to build new apps 
• Compose new 
applications from 
existing applications 
and BigSheets 
• Invoke analytics 
applications from the 
web console, including 
integration within 
BigSheets 
• REST data source App 
that enables users to 
load data from any data source supporting REST APIs into BigInsights, 
including popular social media services 
• Sampling App that enables users to sample data for analysis 
• Subsetting App that enables users to subset data for data analysis 
© 2014 International Business Machines Corporation 12 
12
Collaborative Big Data for many roles 
• Business Users can get their hands on big 
data and use big data applications and 
BigSheets to get insights into their data 
§ Data scientists can perform deeper 
analysis and get richer insights 
§ Administrators are empowered to be 
more agile through better controls and 
views into key performance indicators 
§ Developers can leverage unified tooling in a Big Data 
Application Development Lifecycle and are able to 
create and deploy new types of applications, with 
enhancements that simplify even complex workflows 
© 2014 International Business Machines Corporation 13
Big SQL 3.0 – Architected for Performance 
• Leverage IBM's rich SQL heritage, expertise, and technology 
– Modern SQL:2011 capabilities 
– DB2 compatible SQL PL support 
• SQL bodied functions and stored procedures 
• Application logic/security encapsulation 
• Architected from the ground up for performance 
– low latency and high throughput 
• MapReduce replaced with a modern MPP 
architecture 
– Compiler and runtime are native code (not java) 
– Big SQL worker daemons live directly on cluster 
– Continuously running (no startup latency) 
– Processing happens locally at the data 
• Operations occur in memory with the ability 
to spill to disk 
– Supports aggregations and sorts larger than available RAM 
• Integration with BigSheets (source & target) 
SQL-based 
Application 
IBM Data Server Client 
Big SQL 
SQL MPP Runtime 
Data Sources 
Parquet CSV Seq RC 
Avro ORC JSON Custom 
InfoSphere BigInsights 
© 2014 International Business Machines Corporation 14
Big SQL 3.0 – Architecture cont. 
• Head (coordinator / management) node 
– Listens to the JDBC/ODBC connections and compiles / optimizes the query 
– Coordinates the execution of the query 
– Optionally store user data in traditional RDBMS table (single node only) 
• Big SQL worker processes reside on compute nodes (some or all) 
• Worker nodes stream data between each other as needed 
• Workers can spill large data sets to local disk if needed 
– Allows Big SQL to work with data sets 
larger than available memory 
Big SQL 
Mgmt Node 
Hive 
Metastore 
Mgmt Node 
Name Node 
Mgmt Node 
••• Job Tracker 
Mgmt Node 
Task 
Tracker 
Data 
Node 
Big 
SQL 
Big 
SQL 
••• Node Big 
SQL 
Compute Node 
Task 
Tracker 
Data 
Node 
Compute Node 
Task 
Tracker 
Data 
Node 
Compute Node 
Task 
Tracker 
Data 
Big 
SQL 
Compute Node 
GPFS/HDFS 
© 2014 International Business Machines Corporation 15
Big SQL 3.0 – Features 
Application Portability & Integration 
Data shared with Hadoop ecosystem 
Comprehensive file format support 
Superior enablement of IBM software 
Enhanced by Third Party software 
Performance 
Modern MPP runtime 
Powerful SQL query rewriter 
Cost based optimizer 
Optimized for concurrent user throughput 
Results not constrained by memory 
Rich SQL 
Comprehensive SQL Support 
IBM SQL PL compatibility 
Distributed requests to multiple data 
sources within a single SQL statement 
Main data sources supported: 
DB2 LUW, DB2/z, Teradata, Oracle, Netezza 
Advanced security/auditing 
Resource and workload management 
Self tuning memory management 
Comprehensive monitoring 
Federation 
Enterprise Features 
© 2014 International Business Machines Corporation 16
BigSQL Demo 
© 2014 International Business Machines Corporation 17
Comparing Big SQL 3.0 and Hive 0.12 for Ad-Hoc Queries 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
BigSQL 
3.0 
Parquet 
vs 
Hive 
0.12 
ORC 
1TB 
Classic 
BI 
Workload 
Big SQL is up 
to 41x faster 
than Hive 0.12 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
Elapsed 
Time 
(sec) 
Query 
number 
Hive 
0.12 
BigSQL 
3.0 
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" 
in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, 
running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and 
TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, 
each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, 
configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 
© 2014 International Business Machines Corporation 18
IBM BigInsights brings efficient integration of R with Big R 
• R as a big data query language 
– Outside-in execution 
• R as a statistical language for 
deep computing 
– Inside-out execution 
– Partitioning of large data (“divide”) 
– Parallel cluster execution of pushed 
down R code (“conquer”) 
– Almost any R package can run in 
this environment 
• R as the gateway to scalable 
machine learning 
– A scalable ML engine that provides 
canned algorithms, and an ability to 
author new ones, all via R 
R Clients 
Pull data 
(summaries) to 
R client 
Scalable 
ML 
Engine 
Data Sources 
R Packages 
R Packages 
Embedded R Execution 
Or, push R 
functions right 
on the data 
© 2014 International Business Machines Corporation 19
Text Analytics in BigInsights 
Distill structured information from 
unstructured data 
– Rich annotator library supports multiple 
languages 
– Declarative Information Extraction (IE) system 
based on an algebraic framework 
– Richer, cleaner rule semantics 
– Better performance through optimization 
How it works 
• Parses text and detects meaning with annotators 
• Understands the context in which the text is 
analyzed 
• Hundreds of pre-built annotators for names, 
addresses, phone numbers, along others 
Accuracy 
• Highly accurate in deriving meaning from 
complex text 
Performance 
• AQL language optimized for MapReduce 
Unstructured text (document, email, etc) 
Football World Cup 2010, one team 
distinguished themselves well, losing to 
the eventual champions 1-0 in the Final. 
Early in the second half, Netherlands’ 
striker, Arjen Robben, had a breakaway, 
but the keeper for Spain, Iker Casillas 
made the save. Winger Andres Iniesta 
scored for Spain for the win. 
Classification and Insight 
© 2014 International Business Machines Corporation 20
BigInsights offers value beyond Open Source 
Enterprise Capabilities 
Visualization & Exploration 
Development Tools 
Advanced Engines 
Connectors 
Workload Optimization 
Administration & Security 
Key differentiators 
• Built-in analytics 
• Enterprise software integration 
• Spreadsheet-style analysis 
• Integrated installation of supported open 
Open source 
components 
IBM-certified 
Apache 
Hadoop 
source and other components 
• Web Console for admin and application 
access 
• Platform enrichment: additional security, 
performance features, . . . 
• World-class support 
• Full open source compatibility 
Business benefits 
• Quicker time-to-value due to IBM 
technology and support 
• Reduced operational risk 
• Enhanced business knowledge with flexible 
analytical platform 
• Leverages and complements existing 
software 
© 2014 International Business Machines Corporation 21
InfoSphere BigInsights for Hadoop includes the latest Open 
Source components, enhanced by enterprise components 
IBM InfoSphere BigInsights for Hadoop 
Visualization & Ad 
Hoc Analytics 
BigSheets 
Charting Dashboard 
Advanced Analytics 
R Big R Analytics 
Data 
Access 
Runtime 
Data Store 
File System 
Security 
Resource Management & 
Oozie 
Administration 
YARN* 
Applications & Development 
Governance 
Text 
Jaql 
Eclipse Tooling: 
MapReduce, Hive, Jaql, 
Pig, Big SQL, AQL 
Flume 
Sqoop 
HCatalog 
Hive Pig 
MapReduce 
HBase 
HDFS 
BigSheets Reader 
and Macro 
Text Analytics 
Extractors 
Stream Computing 
Streams 
Adaptive MapReduce 
Solr/ 
Lucene 
Enterprise 
Search 
ETL 
Big SQL 
Open Source IBM 
Kerberos 
ZooKeeper 
Console Monitoring 
Audit & History 
GPFS FPO 
LDAP Data Security for Hadoop 
Data Masking Data Matching Data Privacy for Hadoop 
Search 
Flexible 
Scheduler 
* In Beta 
© 2014 International Business Machines Corporation 22
From Getting Starting to Enterprise Deployment: 
Different BigInsights Editions For Varying Needs 
Enterprise Edition 
Standard Edition 
- Spreadsheet-style tool 
- - Dashboards 
- Pre-built applications 
- - Eclipse tooling 
- - RDBMS connectivity 
- - Monitoring and alerts 
- - Platform enhancements 
- Accelerators 
- - GPFS – FPO 
- - Adaptive MapReduce 
- Text analytics 
- Enterprise Integration 
- - Big R 
- - InfoSphere Streams* 
- - Watson Explorer* 
- - Cognos BI* 
- - Data Click* 
- - . . . 
- * Limited use license 
Breadth of capabilities 
Enterprise class 
- - Web console 
- - Big SQL 
- - . . . 
Apache 
Hadoop 
Quick Start 
Free. Non-production 
Same features as 
Standard Edition plus text 
analytics and Big R 
© 2014 International Business Machines Corporation 23
IBM big data • IBM big data • IBM big data 
IBM big data • IBM big data • IBM big data 
IBM big data • IBM big data 
IBM big data • IBM big data 
THINK

More Related Content

What's hot

Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

What's hot (20)

Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
Introduction to Azure Data Lake
Introduction to Azure Data LakeIntroduction to Azure Data Lake
Introduction to Azure Data Lake
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Choosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloudChoosing technologies for a big data solution in the cloud
Choosing technologies for a big data solution in the cloud
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
Hadoop data-lake-white-paper
Hadoop data-lake-white-paperHadoop data-lake-white-paper
Hadoop data-lake-white-paper
 
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
Big Data:  Big SQL web tooling (Data Server Manager) self-study labBig Data:  Big SQL web tooling (Data Server Manager) self-study lab
Big Data: Big SQL web tooling (Data Server Manager) self-study lab
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
 

Viewers also liked

[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka
[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka
[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka
Insight Technology, Inc.
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
Romeo Kienzler
 
Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714
Niu Bai
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
Konrad Malawski
 

Viewers also liked (20)

SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】SQL on Hadoop 比較検証 【2014月11日における検証レポート】
SQL on Hadoop 比較検証 【2014月11日における検証レポート】
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Next Generation Spend Analytics & Data Visualization
Next Generation Spend Analytics & Data VisualizationNext Generation Spend Analytics & Data Visualization
Next Generation Spend Analytics & Data Visualization
 
[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka
[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka
[D35] 今ミッション・クリティカル環境で求められるデータベース・クラスタリング技術とは? by Kousuke Osaka
 
Puredataの基礎
Puredataの基礎Puredataの基礎
Puredataの基礎
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
理論から学ぶデータベース実践入門Night(mvccでちょっとハマった話)
理論から学ぶデータベース実践入門Night(mvccでちょっとハマった話)理論から学ぶデータベース実践入門Night(mvccでちょっとハマった話)
理論から学ぶデータベース実践入門Night(mvccでちょっとハマった話)
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
The HR 3.0 Framework
The HR 3.0 FrameworkThe HR 3.0 Framework
The HR 3.0 Framework
 
Analytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big dataAnalytics 3.0 Measurable business impact from analytics & big data
Analytics 3.0 Measurable business impact from analytics & big data
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
「360°スゴイ」を創るVOYAGE GROUPエンジニア成長施策
「360°スゴイ」を創るVOYAGE GROUPエンジニア成長施策「360°スゴイ」を創るVOYAGE GROUPエンジニア成長施策
「360°スゴイ」を創るVOYAGE GROUPエンジニア成長施策
 
Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Reactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka StreamsReactive Stream Processing with Akka Streams
Reactive Stream Processing with Akka Streams
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 

Similar to Big SQL 3.0 - Fast and easy SQL on Hadoop

Preparing for BI in the Cloud with Windows Azure
Preparing for BI in the Cloud with Windows AzurePreparing for BI in the Cloud with Windows Azure
Preparing for BI in the Cloud with Windows Azure
Perficient, Inc.
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
bigdata sunil
 

Similar to Big SQL 3.0 - Fast and easy SQL on Hadoop (20)

Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
Hadoop in 2015: Keys to Achieving Operational Excellence for the Real-Time En...
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT InfrastructuresOPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
OPEN'17_4_Postgres: The Centerpiece for Modernising IT Infrastructures
 
Preparing for BI in the Cloud with Windows Azure
Preparing for BI in the Cloud with Windows AzurePreparing for BI in the Cloud with Windows Azure
Preparing for BI in the Cloud with Windows Azure
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Bigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExpBigdata.sunil_6+yearsExp
Bigdata.sunil_6+yearsExp
 
Resume
ResumeResume
Resume
 

More from Wilfried Hoge

2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
Wilfried Hoge
 

More from Wilfried Hoge (8)

Cloud Data Services - from prototyping to scalable analytics on cloud
Cloud Data Services - from prototyping to scalable analytics on cloudCloud Data Services - from prototyping to scalable analytics on cloud
Cloud Data Services - from prototyping to scalable analytics on cloud
 
Is it harder to find a taxi when it is raining?
Is it harder to find a taxi when it is raining? Is it harder to find a taxi when it is raining?
Is it harder to find a taxi when it is raining?
 
innovations born in the cloud - cloud data services from IBM to prototype you...
innovations born in the cloud - cloud data services from IBM to prototype you...innovations born in the cloud - cloud data services from IBM to prototype you...
innovations born in the cloud - cloud data services from IBM to prototype you...
 
2015.05.07 watson rp15
2015.05.07 watson rp152015.05.07 watson rp15
2015.05.07 watson rp15
 
Twitter analytics in Bluemix
Twitter analytics in BluemixTwitter analytics in Bluemix
Twitter analytics in Bluemix
 
2013.12.12 big data heise webcast
2013.12.12 big data heise webcast2013.12.12 big data heise webcast
2013.12.12 big data heise webcast
 
2012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum22012.04.26 big insights streams im forum2
2012.04.26 big insights streams im forum2
 
IBM - Big Value from Big Data
IBM - Big Value from Big DataIBM - Big Value from Big Data
IBM - Big Value from Big Data
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Big SQL 3.0 - Fast and easy SQL on Hadoop

  • 1. © 2014 IBM Corporation Big SQL 3.0 Fast and easy SQL on Hadoop Wilfried Hoge IT Architect Big Data hoge@de.ibm.com @wilfriedhoge z/OS und LUW
  • 2. Hadoop Observations Technology Customers Vendors Rapid innovation Two sources of innovation - Open source community - Integration of existing technologies Tools and application vendors selecting partners and integrating High degree of interest Many experimental workstreams ROI establishment varies by use case Many customers want to offload data from EDW Multiple business models OSS support vendors have mindshare lead OSS support vendors business model viability unclear SW Portfolio vendors integrating/adding © 2014 International Business Machines Corporation 2
  • 3. InfoSphere BigInsights provides Enterprise Grade Hadoop analytics • Manages a wide variety and huge volume of data • Augments open source Hadoop with enterprise capabilities – Visualization & Exploration – Development tools – Advanced Engines – Connectors – Workload Optimization – Enterprise integration – Analytic Accelerators – Application and industry accelerators – Administration & Security BIG DATA PLATFORM Application Discovery Development Accelerators Data Warehouse Stream Computing Systems Management Hadoop System Information Integration & Governance Data Media Content Machine Social © 2014 International Business Machines Corporation 3 © 2013 IBM Corporation
  • 4. Key Differentiators for BigInsights Enterprise Performance & Integration Analytics Usability & Productivity • Workload / performance optimization • GPFS • Security • Key integrations & Connectors with Enterprise Ecosystem • Text analytics • Social Data Analytics Accelerators • Machine Data Analytics Accelerators • Execute R in an integrated application • Big SQL • BigSheets • Development Tools • Web Console © 2014 International Business Machines Corporation 4
  • 5. Integrated Web Console • Manage BigInsights – Inspect /monitor system health – Add / drop nodes – Start / stop services – Run / monitor jobs (applications) – Explore / modify file system – Create custom dashboards • Launch applications – Spreadsheet-like analysis tool – Pre-built applications (IBM supplied or user developed) • Publish applications • Monitor cluster, applications, data – Create / view event alerts. © 2014 International Business Machines Corporation 5
  • 6. Distributed Filesystem © 2014 International Business Machines Corporation 6 6 Applications High level languages (SQL, JAQL, PIG, …) Map/Reduce API Hadoop DFS API GPFS HDFS Distributed filesystem GPFS FPO gives additional flexibility, security and high availability • Optional file system alternative to HDFS • More than 10 years experience with HPC • Key features – No single point of failure – Built-in High Availability – POSIX compliance • Standard applications cannot use HDFS but they can use GPFS-FPO – Enhanced Security – Higher performance • Allows concurrent read and write by multiple programs – Recovery capabilties • Journaling filesystem – Support for Storage Pools – SnapShot capability
  • 7. BigInsights has a simple but effective security system based on a gateway to Hadoop Users Sources • All Hadoop servers are connected over a private network • Unrestricted communication between cluster servers on the private network • BigInsights Web Console acts as a gateway into the cluster • Authentication through PAM or LDAP • Role based authorization • Authorization will be enforced at 3 levels: – UI level – Data level – Map-Reduce level • Authorization also respected by services (e.g. SQL) • Kerberos support Authentication Authority External Gateway / Web Console Services Data Nodes Infrastr. Nodes Distributed Filesystem © 2014 International Business Machines Corporation 7
  • 8. BigSheets to analyze and visualize • Model “big data” collected from various sources in spreadsheet-like structures • Filter and enrich content with built-in functions • Combine data in different workbooks • Visualize results through spreadsheets, charts • Export data into common formats (if desired) No programming knowledge needed! © 2014 International Business Machines Corporation 8
  • 9. Centralized dashboard & data flows © 2014 International Business Machines Corporation 9 9 A centralized dashboard to visualize analytic results: • BigSheets collections • Analytic application results • Monitoring metrics • Ability to view BigSheets data flows between and across data sets to quickly navigate and relate analysis and charts • Visualize inner outer joins, enhanced filters for BigSheets columns, column data-type mapping for collections and application of analytics to BigSheets columns, … etc
  • 10. Tools for Developers 5. Deploy your application on the cluster © 2014 International Business Machines Corporation 10 10 Editors • A workflow editor that greatly simplifies the creation of complex Oozie workflows with a consumable interface • A Pig/Jaql Editor with content assist and syntax highlighting that enables users to create and execute new applications using Pig or Jaql in local or cluster mode from the Eclipse IDE Application development & deployment • Enablement of BigSheets macro and BigSheets reader development • Text Analytics development, including support for modular rule sets • Publish new application: BigSheets Macro, BigSheets Reader, AQL module, Jaql module 1. Sample your Data 2. Develop your application using BigInsights tools 3. Test your application 4. Package and publish your application
  • 11. Running Applications on Big Data • Browse available applications • Deploy published applications (administrators only) • Launch (or schedule for launch) a deployed application • Monitor job (application) execution status • Predefined applications • Import & Export Data • Database & Files • Web and Social • Analyze and Query • Predictive Analytics • Text Analytics • SQL/Hive, Jaql, Pig, Hbase • Accelerators © 2014 International Business Machines Corporation 11
  • 12. Application linking and interfaces to build new apps • Compose new applications from existing applications and BigSheets • Invoke analytics applications from the web console, including integration within BigSheets • REST data source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services • Sampling App that enables users to sample data for analysis • Subsetting App that enables users to subset data for data analysis © 2014 International Business Machines Corporation 12 12
  • 13. Collaborative Big Data for many roles • Business Users can get their hands on big data and use big data applications and BigSheets to get insights into their data § Data scientists can perform deeper analysis and get richer insights § Administrators are empowered to be more agile through better controls and views into key performance indicators § Developers can leverage unified tooling in a Big Data Application Development Lifecycle and are able to create and deploy new types of applications, with enhancements that simplify even complex workflows © 2014 International Business Machines Corporation 13
  • 14. Big SQL 3.0 – Architected for Performance • Leverage IBM's rich SQL heritage, expertise, and technology – Modern SQL:2011 capabilities – DB2 compatible SQL PL support • SQL bodied functions and stored procedures • Application logic/security encapsulation • Architected from the ground up for performance – low latency and high throughput • MapReduce replaced with a modern MPP architecture – Compiler and runtime are native code (not java) – Big SQL worker daemons live directly on cluster – Continuously running (no startup latency) – Processing happens locally at the data • Operations occur in memory with the ability to spill to disk – Supports aggregations and sorts larger than available RAM • Integration with BigSheets (source & target) SQL-based Application IBM Data Server Client Big SQL SQL MPP Runtime Data Sources Parquet CSV Seq RC Avro ORC JSON Custom InfoSphere BigInsights © 2014 International Business Machines Corporation 14
  • 15. Big SQL 3.0 – Architecture cont. • Head (coordinator / management) node – Listens to the JDBC/ODBC connections and compiles / optimizes the query – Coordinates the execution of the query – Optionally store user data in traditional RDBMS table (single node only) • Big SQL worker processes reside on compute nodes (some or all) • Worker nodes stream data between each other as needed • Workers can spill large data sets to local disk if needed – Allows Big SQL to work with data sets larger than available memory Big SQL Mgmt Node Hive Metastore Mgmt Node Name Node Mgmt Node ••• Job Tracker Mgmt Node Task Tracker Data Node Big SQL Big SQL ••• Node Big SQL Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node Compute Node Task Tracker Data Big SQL Compute Node GPFS/HDFS © 2014 International Business Machines Corporation 15
  • 16. Big SQL 3.0 – Features Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file format support Superior enablement of IBM software Enhanced by Third Party software Performance Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Rich SQL Comprehensive SQL Support IBM SQL PL compatibility Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2 LUW, DB2/z, Teradata, Oracle, Netezza Advanced security/auditing Resource and workload management Self tuning memory management Comprehensive monitoring Federation Enterprise Features © 2014 International Business Machines Corporation 16
  • 17. BigSQL Demo © 2014 International Business Machines Corporation 17
  • 18. Comparing Big SQL 3.0 and Hive 0.12 for Ad-Hoc Queries 3500 3000 2500 2000 1500 1000 500 0 BigSQL 3.0 Parquet vs Hive 0.12 ORC 1TB Classic BI Workload Big SQL is up to 41x faster than Hive 0.12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Elapsed Time (sec) Query number Hive 0.12 BigSQL 3.0 *Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 © 2014 International Business Machines Corporation 18
  • 19. IBM BigInsights brings efficient integration of R with Big R • R as a big data query language – Outside-in execution • R as a statistical language for deep computing – Inside-out execution – Partitioning of large data (“divide”) – Parallel cluster execution of pushed down R code (“conquer”) – Almost any R package can run in this environment • R as the gateway to scalable machine learning – A scalable ML engine that provides canned algorithms, and an ability to author new ones, all via R R Clients Pull data (summaries) to R client Scalable ML Engine Data Sources R Packages R Packages Embedded R Execution Or, push R functions right on the data © 2014 International Business Machines Corporation 19
  • 20. Text Analytics in BigInsights Distill structured information from unstructured data – Rich annotator library supports multiple languages – Declarative Information Extraction (IE) system based on an algebraic framework – Richer, cleaner rule semantics – Better performance through optimization How it works • Parses text and detects meaning with annotators • Understands the context in which the text is analyzed • Hundreds of pre-built annotators for names, addresses, phone numbers, along others Accuracy • Highly accurate in deriving meaning from complex text Performance • AQL language optimized for MapReduce Unstructured text (document, email, etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Classification and Insight © 2014 International Business Machines Corporation 20
  • 21. BigInsights offers value beyond Open Source Enterprise Capabilities Visualization & Exploration Development Tools Advanced Engines Connectors Workload Optimization Administration & Security Key differentiators • Built-in analytics • Enterprise software integration • Spreadsheet-style analysis • Integrated installation of supported open Open source components IBM-certified Apache Hadoop source and other components • Web Console for admin and application access • Platform enrichment: additional security, performance features, . . . • World-class support • Full open source compatibility Business benefits • Quicker time-to-value due to IBM technology and support • Reduced operational risk • Enhanced business knowledge with flexible analytical platform • Leverages and complements existing software © 2014 International Business Machines Corporation 21
  • 22. InfoSphere BigInsights for Hadoop includes the latest Open Source components, enhanced by enterprise components IBM InfoSphere BigInsights for Hadoop Visualization & Ad Hoc Analytics BigSheets Charting Dashboard Advanced Analytics R Big R Analytics Data Access Runtime Data Store File System Security Resource Management & Oozie Administration YARN* Applications & Development Governance Text Jaql Eclipse Tooling: MapReduce, Hive, Jaql, Pig, Big SQL, AQL Flume Sqoop HCatalog Hive Pig MapReduce HBase HDFS BigSheets Reader and Macro Text Analytics Extractors Stream Computing Streams Adaptive MapReduce Solr/ Lucene Enterprise Search ETL Big SQL Open Source IBM Kerberos ZooKeeper Console Monitoring Audit & History GPFS FPO LDAP Data Security for Hadoop Data Masking Data Matching Data Privacy for Hadoop Search Flexible Scheduler * In Beta © 2014 International Business Machines Corporation 22
  • 23. From Getting Starting to Enterprise Deployment: Different BigInsights Editions For Varying Needs Enterprise Edition Standard Edition - Spreadsheet-style tool - - Dashboards - Pre-built applications - - Eclipse tooling - - RDBMS connectivity - - Monitoring and alerts - - Platform enhancements - Accelerators - - GPFS – FPO - - Adaptive MapReduce - Text analytics - Enterprise Integration - - Big R - - InfoSphere Streams* - - Watson Explorer* - - Cognos BI* - - Data Click* - - . . . - * Limited use license Breadth of capabilities Enterprise class - - Web console - - Big SQL - - . . . Apache Hadoop Quick Start Free. Non-production Same features as Standard Edition plus text analytics and Big R © 2014 International Business Machines Corporation 23
  • 24. IBM big data • IBM big data • IBM big data IBM big data • IBM big data • IBM big data IBM big data • IBM big data IBM big data • IBM big data THINK