SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Downloaden Sie, um offline zu lesen
© 2014 IBM Corporation
Taming Big Data with Big SQL
Session 3477
Berni Schiefer (schiefer@ca.ibm.com)
Bert Van der Linden (robbert@us.ibm.com)
Please Note
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be incorporated
into any contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
1
Agenda
Internet of Things & Big Data
BigInsights
Big SQL 3.0
• Architecture
• Performance
• Best practices
2
Systems of Insight from Data
Emanating from the Internet of Things (IoT)
3
How Big is the Internet of Things?
4
2+
billion
people on
the Web
by end
2011
30 billion RFID
tags today
(1.3B in 2005)
4.6
billion
camera
phones
world wide
100s of
millions
of GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
12+ TBs
of tweet data
every day
25+ TBs of
log data every
day
?TBsof
dataeverydayWhere is Big Data coming from?
5
Now, they’re installing
smart meters.
10 million smart meters read every 15 minutes =
350 billion transactions a year.
A major gas and electric utility has
10 million meters.
They read the meters
once a month.
The meters are read
once an hour.
6
Data AVAILABLE to
an organization
Data an organization
can PROCESS
The Big Data Conundrum
The percentage of available data an enterprise can analyze is
decreasing
This means enterprises are getting “more naive” over time
7
Transactional &
Application Data
Machine Data Social Data Enterprise
Content
• Volume
• Structured
• Throughput
• Velocity
• Structured
• Ingestion
• Variety
• Unstructured
• Veracity
• Variety
• Unstructured
• Volume
Big Data is All Data and All Paradigms
8
Big Data Landing zone eco-system
9
Big Data “Landing zone” eco-system
BigSQL
10
BI /
Report
ing
BI /
Reporting
Exploration /
Visualization
Functional
App
Industry
App
Predictive
Analytics
Content
Analytics
Analytic Applications
IBM Big Data Platform
Systems
Management
Application
Development
Visualization
& Discovery
Accelerators
Information Integration & Governance
Stream
Computing
Data
Warehouse
InfoSphere BigInsights builds on open source Hadoop
capabilities for enterprise class deployments
Open source
based
components
Enterprise
capabilities Hadoop
System
Business benefits
– Quicker time-to-value due to IBM
technology and support
– Reduced operational risk
– Enhanced business knowledge
with flexible analytical platform
– Leverages and complements
existing software
InfoSphere BigInsights
Administration & Security
Workload Optimization
Connectors
Advanced Engines
Visualization & Exploration
Development Tools
Open source Hadoop
components
Big SQL
11
© 2013 IBM Corporation
Announcing BigInsights v3.0
Standard Edition
Breadth of capabilities
Enterpriseclass
Enterprise Edition
- Spreadsheet-style tool
- Web console
- Dashboards
- Pre-built applications
- Eclipse tooling
- RDBMS connectivity
- Big SQL 3.0
- Jaql
- Platform enhancements
- . . .
- Accelerators
- GPFS – FPO
- Adaptive MapReduce
- Text analytics
- Enterprise Integration
- Monitoring and alerts
- Big R
- InfoSphere Streams*
- Watson Explorer*
- Cognos BI*
- . . .
Apache
Hadoop
12
* Limited use license included
Available for Linux on POWER (Redhat) and Intel x64 Linux (Redhat/SUSE)
Component BigInsights 3.0
HortonWorks
HDP 2.0
MapR 3.1
Pivotal HD
1.1
Cloudera
CDH5
Hadoop 2.2 2.2 1.0.3 2.0.5 * 2.3
HBase 0.96.0 0.96.0 0.94.13 0.94.8 0.96.1
Hive 0.12.0 0.12 0.11 0.11.0 0.12.0
Pig 0.12.0 0.12 0.11.0 0.10.1 0.12.0
Zookeeper 3.4.5 3.4.5 3.4.5 3.4.5 3.4.5
Oozie 4.0.0 4.0.0 3.3.2 3.3.2 4.0.0
Avro 1.7.5 X X X 1.7.5
Flume 1.4.0 1.4.0 1.4.0 1.3.1 1.4.0
Sqoop 1.4.4 1.4.4 1.4.4 1.4.2 1.4.4
Current as of April 27, 2014
Common Hadoop core in all Hadoop Distributions
13
What is Big SQL 3.0?
Comprehensive SQL functionality
• IBM SQL/PL support, including…
• Stored procedures (SQL bodied and external)
• Functions (SQL bodied and external)
• IBM Data Server JDBC and ODBC drivers
Leverages advanced IBM SQL compiler/runtime
• High performance native (C++) runtime
Replaces Map/Reduce
• Advanced message passing runtime
• Data flows between nodes without
requiring persisting intermediate results
• Continuous running daemons
• Advanced workload management allows
resources to remain constrained
• Low latency, high throughput…
SQL-based
Application
Big SQL
Engine
InfoSphere BigInsights
Data Sources
IBM data server
client
SQL MPP Run-time
CSVCSV SeqSeq ParquetParquet RCRC
ORCORCAvroAvro CustomCustomJSONJSON
14
Big SQL 3.0 – Architecture
Head (coordinator) node
• Listens to the JDBC/ODBC connections
• Compiles and optimizes the query
• Coordinates the execution of the query
Big SQL worker processes reside on compute nodes (some or all)
Worker nodes stream data between each other as needed
Workers can spill large data sets to local disk if needed
• Allows Big SQL to work with data sets larger than available memory
Mgmt Node
Big SQL
Mgmt Node
Hive
Metastore
Mgmt Node
Name Node
Mgmt Node
Job Tracker•••
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node
Compute Node
Task
Tracker
Data
Node•••
Big
SQL
Big
SQL
Big
SQL
Big
SQL
GPFS/HDFS
15
Big SQL 3.0 works with Hadoop
All data is Hadoop data
• In files in HDFS
• SEQ, RC, delimited, Parquet …
Never need to copy data to a proprietary representation
All data is catalog-ed in the Hive metastore
• It is the Hadoop catalog
• It is flexible and extensible
All Hadoop data is in a Hadoop filesystem
• HDFS or GPFS-FPO
16
Big SQL 3.0 – Architecture (cont.)
Big SQL's runtime execution engine is all native code
For common table formats a native I/O engine is utilized
• e.g. delimited, RC, SEQ, Parquet, …
For all others, a java I/O engine is used
• Maximizes compatibility with existing tables
• Allows for custom file formats and SerDe's
All Big SQL built-in functions are native code
Customer built UDx's can be developed in C++ or Java
• Existing Big SQL UDF's can be used with a slight
change in how they are registered
Mgmt Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Big SQL Worker
Native I/O
Engine
Java I/O
Engine
SerDe I/O Fmt
Runtime
Java UDFs
Native UDFs
17
Big SQL 3.0 – Enterprise security
Users may be authenticated via
• Operating system
• Lightweight directory access protocol (LDAP)
• Kerberos
User authorization mechanisms include
• Full GRANT/REVOKE based security
• Group and role based hierarchical security
• Object level, column level, or row level (fine-grained) access controls
Auditing
• You may define audit policies and track user activity
Transport layer security (TLS)
• Protect integrity and confidentiality of data between the client and Big SQL
18
Row Based Access Control - 4 easy steps
2) Create permissions *
CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBLCREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBLCREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBLCREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL
FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE'FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE'FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE'FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE') = 1) = 1) = 1) = 1
ANDANDANDAND
BRANCH_TBL.BRANCH_NAME = 'Branch_A')BRANCH_TBL.BRANCH_NAME = 'Branch_A')BRANCH_TBL.BRANCH_NAME = 'Branch_A')BRANCH_TBL.BRANCH_NAME = 'Branch_A')
ENFORCED FOR ALL ACCESSENFORCED FOR ALL ACCESSENFORCED FOR ALL ACCESSENFORCED FOR ALL ACCESS
ENABLEENABLEENABLEENABLE
3) Enable access control *
ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROLALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROLALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROLALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL
4) Select as Branch_A user
CONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newton
SELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBL
EMP_NO FIRST_NAME BRANCH_NAMEEMP_NO FIRST_NAME BRANCH_NAMEEMP_NO FIRST_NAME BRANCH_NAMEEMP_NO FIRST_NAME BRANCH_NAME
-------------------------------------------- ------------------------------------------------ --------------------------------------------
2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A
3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A
5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A
8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A
4 record(s) selected.4 record(s) selected.4 record(s) selected.4 record(s) selected.
Data
SELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBL
EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY
---------------------------- ------------------------------------------------ --------------------------------------------
1 Steve Branch_B1 Steve Branch_B1 Steve Branch_B1 Steve Branch_B
2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A
3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A
4 Craig Branch_B4 Craig Branch_B4 Craig Branch_B4 Craig Branch_B
5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A
6 Stephanie Branch_B6 Stephanie Branch_B6 Stephanie Branch_B6 Stephanie Branch_B
7 Julie Branch_B7 Julie Branch_B7 Julie Branch_B7 Julie Branch_B
8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A
1) Create and grant access and roles *
CREATE ROLE BRANCH_A_ROLECREATE ROLE BRANCH_A_ROLECREATE ROLE BRANCH_A_ROLECREATE ROLE BRANCH_A_ROLE
GRANT ROLE BRANCH_A_ROLE TO USER newtonGRANT ROLE BRANCH_A_ROLE TO USER newtonGRANT ROLE BRANCH_A_ROLE TO USER newtonGRANT ROLE BRANCH_A_ROLE TO USER newton
GRANT SELECT ON BRANCH_TBL TO USER newtonGRANT SELECT ON BRANCH_TBL TO USER newtonGRANT SELECT ON BRANCH_TBL TO USER newtonGRANT SELECT ON BRANCH_TBL TO USER newton
* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with
SECADM authority.SECADM authority.SECADM authority.SECADM authority.
19
Column Based Access Control
2) Create permissions *
CREATE MASK SALARY_MASK ON SAL_TBL FORCREATE MASK SALARY_MASK ON SAL_TBL FORCREATE MASK SALARY_MASK ON SAL_TBL FORCREATE MASK SALARY_MASK ON SAL_TBL FOR
COLUMN SALARY RETURNCOLUMN SALARY RETURNCOLUMN SALARY RETURNCOLUMN SALARY RETURN
CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1
THEN SALARYTHEN SALARYTHEN SALARYTHEN SALARY
ELSE 0.00ELSE 0.00ELSE 0.00ELSE 0.00
ENDENDENDEND
ENABLEENABLEENABLEENABLE
3) Enable access control *
ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROLALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROLALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROLALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL
4b) Select as a MANAGER
CONNECT TO TESTDB USER socratesCONNECT TO TESTDB USER socratesCONNECT TO TESTDB USER socratesCONNECT TO TESTDB USER socrates
SELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY
---------------------------- ------------------------------------------------ --------------------------------------------
1 Steve 2500001 Steve 2500001 Steve 2500001 Steve 250000
2 Chris 2000002 Chris 2000002 Chris 2000002 Chris 200000
3 Paula 10000003 Paula 10000003 Paula 10000003 Paula 1000000
3 record(s) selected.3 record(s) selected.3 record(s) selected.3 record(s) selected.
Data
SELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY
---------------------------- ------------------------------------------------ --------------------------------------------
1 Steve 2500001 Steve 2500001 Steve 2500001 Steve 250000
2 Chris 2000002 Chris 2000002 Chris 2000002 Chris 200000
3 Paula 10000003 Paula 10000003 Paula 10000003 Paula 1000000
1) Create and grant access and roles *
CREATE ROLE MANAGERCREATE ROLE MANAGERCREATE ROLE MANAGERCREATE ROLE MANAGER
CREATE ROLE EMPLOYEECREATE ROLE EMPLOYEECREATE ROLE EMPLOYEECREATE ROLE EMPLOYEE
GRANT SELECT ON SAL_TBL TO USER socratesGRANT SELECT ON SAL_TBL TO USER socratesGRANT SELECT ON SAL_TBL TO USER socratesGRANT SELECT ON SAL_TBL TO USER socrates
GRANT SELECT ON SAL_TBL TO USER newtonGRANT SELECT ON SAL_TBL TO USER newtonGRANT SELECT ON SAL_TBL TO USER newtonGRANT SELECT ON SAL_TBL TO USER newton
GRANT ROLE MANAGER TO USER socratesGRANT ROLE MANAGER TO USER socratesGRANT ROLE MANAGER TO USER socratesGRANT ROLE MANAGER TO USER socrates
GRANT ROLE EMPLOYEE TO USER newtonGRANT ROLE EMPLOYEE TO USER newtonGRANT ROLE EMPLOYEE TO USER newtonGRANT ROLE EMPLOYEE TO USER newton
4a) Select as an EMPLOYEE
CONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newton
SELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY
---------------------------- ------------------------------------------------ --------------------------------------------
1 Steve 01 Steve 01 Steve 01 Steve 0
2 Chris 02 Chris 02 Chris 02 Chris 0
3 Paula 03 Paula 03 Paula 03 Paula 0
3 record(s) selected.3 record(s) selected.3 record(s) selected.3 record(s) selected.
* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with
SECADM authority.SECADM authority.SECADM authority.SECADM authority.
20
Big SQL 3.0 – Other enterprise features
Federation
• Join between your Hadoop data and other external relational platforms
• Optimizer determines most efficient execution path
Open integration across Business Analytic Tools
• IBM Optim Data Studio performance tool portfolio
• Superior enablement for IBM Software – e.g. Cognos
• Enhanced support by 3rd party software – e.g. Microstrategy
Mixed workload cluster management
• Capacity sharing with the rest of the cluster
– Specify %cpu and %memory to dedicate to BigSQL 3.0
• SQL based workload management
• Integration with Platform Symphony to manage mixed cluster workloads
Support for standard development tools
21
Workload Management
(2) Identify workloads and associate to service classes
create workload SALES_WL CURRENT
client_appname(‘SalesSys') service class HIGHPRIWORK
create workload ITEMCOUNT_WL CURRENT
client_appname(‘InventorySys') service class LOWPRIWORK
(1) Create service classes
create service class BIGDATAWORK
create service class HIGHPRIWORK under BIGDATAWORK
create service class LOWPRIWORK under BIGDATAWORK
(3a) Avoid thrashing by queueing low priority work.
create threshold LOW_CONCURRENT for service class
LOWPRIWORK under BIGDATAWORK activities enforcement
database enable when concurrentdbcoordactivities > 5 and
queued activities unbounded continue
(3b) Stop high priority job if SLA cannot be met
create threshold HIGH_CONCURRENT for service class
HEAVYQUERIES under BIGDATAWORK activities
enforcement database enable when concurrentdbcoordactivities
> 30 and queued activities > 0 stop execution
(4a) Stop very long running jobs
create threshold LOWPRI_WL_TIMEOUT for service
class LOWPRIWORK under BIGDATAWORK activities
enforcement database enable when activitytotaltime >
30 minutes stop execution
(4b) Stop jobs that return too many rows
create threshold TOO_MANY_ROWS_RETURNED for
service class HIGHPRIWORK under BIGDATAWORK
enforcement database when sqlrowsreturned >30 stop
execution
(5) Collect data for long running jobs
Create threshold LONGRUNINVENTORYACTIVITIES
for service class LOWPRIWORK activities enforcement
database when activitytotaltime > 15 minutes collect
activity data with details continue
(6) Reporting system activity
create event monitor BIGDATAMONACT for
activities write to table
Using existing standard SQL tools: Eclipse
•Using existing SQL tooling against BigData,
•Same setup as for existing SQL sources!!
•Support for “standard” authentication!!
23
Using existing standard SQL tools: SQuirrel SQL
•Using existing SQL tooling against BigData
•Support for authenticating (not supported for Hive,
BUT supported by Big SQL!)
24
Using BigSheets in BigInsights: data discovery
•Discovery and analytics in a spreadsheet-like environment.
Big SQL 3.0 – Performance
Query rewrites
• Exhaustive query rewrite capabilities
• Leverages additional metadata such as constraints and nullability
Optimization
• Statistics and heuristic driven query optimization
• Query optimizer based upon decades of IBM RDBMS experience
Tools and metrics
• Highly detailed explain plans and query diagnostic tools
• Extensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD),
AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT,
STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKE
Y AND
STORE.STOREKEY=DAILY_SALES.STOREKEY
AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Thread 0
DSS
TQA (tq1)
AGG (complete)
BNO
EXT
Thread 1
TA (Product)
NLJN (Daily
Sales)
NLJN (Period)
NLJN (Store)
AGG (partial)
TQB (tq1)
EXT
Thread 2
TA (DS_IX7)
EXT
Thread 3
TA (PER_IX2)
EXT
Thread 4
TA (ST_IX1)
EXT
Access plan generationQuery transformation
Access
section
~150 query
transformations
Hundreds or thousands
of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period
26
Statistics are key to performance
Table statistics:
• Cardinality (count)
• Number of Files
• Total File Size
Column statistics (this applies to column group stats also):
• Minimum value
• Maximum value
• Cardinality (non-nulls)
• Distribution (Number of Distinct Values)
• Number of null values
• Average Length of the column value (for string columns)
• Histogram
• Frequent Values (MFV)
27
Performance, Benchmarking, Benchmarketing
Performance matters to customers
Benchmarking appeals to Engineers to drive product innovation
Benchmarketing used to convey performance in a memorable
and appealing way
SQL over Hadoop is in the “Wild West” of Benchmarketing
• 100x claims! Compared to what? Conforming to what rules?
The TPC (Transaction Processing Performance Council) is the
grand-daddy of all multi-vendor SQL-oriented organizations
• Formed in August, 1988
• TPC-H and TPC-DS are the most relevant to SQL over Hadoop
– R/W nature of workload not suitable for HDFS
Big Data Benchmarking Community (BDBC) formed
28
Power and Performance of Standard SQL
Everyone loves performance numbers, but that's not the whole story
• How much work do you have to do to achieve those numbers?
A portion of our internal performance numbers are based upon read-only
versions of TPC benchmarks
Big SQL is capable of executing
• All 22 TPC-H queries without modification
• All 99 TPC-DS queries without modification
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written for Hive
29
30
Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic
BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H
Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are
performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3.
Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in
a production environment. Results as of April 22, 2014
Big SQL is up
to 41x faster
than Hive 0.12
Big SQL is up
to 41x faster
than Hive 0.12
31
Comparing Big SQL and Hive 0.12
for Decision Support Queries
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI
Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark
Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of
99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically
available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results
may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production
environment. Results as of April 22, 2014
Big SQL is
10x faster than Hive
0.12
(Total elapsed time)
Big SQL is
10x faster than Hive
0.12
(Total elapsed time)
How many times Faster is Big SQL than Hive 0.12?
* Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI
Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark
Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of
99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically
available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC).
Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results
may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production
environment. Results as of April 22, 2014
Max
Speedup
of 74x
Max
Speedup
of 74x
32
Queries sorted by speed up ratio (worst to best)
Avg
Speedup
of 20x
Avg
Speedup
of 20x
Big SQL 3.0 Best Practices
Ensure you have a homogenous and balanced cluster
• Utilize IBM reference architecture
Choose an optimized file format (if possible)
• ORC or Parquet
Choose appropriate data types
• Use the smallest and most precise datatype available
Define informational constraints
• Primary key, foreign key, check constraints
Ensure you have good statistics
• Current and comprehensive
Use the full power of SQL available to you
• Don’t constrain yourself to Hive syntax/capability
33
BigInsights Big SQL 3.0: Summary
Big SQL provides rich, robust, standards-based SQL support for data
stored in BigInsights
• Uses IBM common client ODBC/JDBC drivers
Big SQL fully integrates with SQL applications and tools
• Existing queries run with no or few modifications*
• Existing JDBC and ODBC compliant tools can be leveraged
Big SQL provides faster and more reliable performance
• Big SQL uses more efficient access paths to the data
• Queries processed by Big SQL no longer need to use MapReduce
• Big SQL is optimized to more efficiently move data over the network
Big SQL provides and enterprise grade data management
• Security, Auditing, workload management …
34
Questions?
We Value Your Feedback
Don’t forget to submit your Impact session and speaker
feedback! Your feedback is very important to us – we use it to
continually improve the conference.
Use the Conference Mobile App or the online Agenda Builder to
quickly submit your survey
• Navigate to “Surveys” to see a view of surveys for sessions
you’ve attended
36
Thank You
Legal Disclaimer
• © IBM Corporation 2014. All Rights Reserved.
• The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained
in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are
subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing
contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and
conditions of the applicable license agreement governing the use of IBM software.
• References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or
capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to
future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by
you will result in any specific sales, revenue growth or other results.
• If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete:
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage
configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
• If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete:
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs
and performance characteristics may vary by customer.
• Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM
Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server).
Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your
presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in
your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International
Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both.
• If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete:
Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries.
• If you reference Java™ in the text, please mark the first use and include the following; otherwise delete:
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
• If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete:
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.
• If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete:
Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and
other countries.
• If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete:
UNIX is a registered trademark of The Open Group in the United States and other countries.
• If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete:
Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of
others.
• If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta
Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration
purposes only.
38
Background:
What is Hadoop?
39
What is Hadoop?
Hadoop is not a piece of software, you can't install "hadoop"
It is an ecosystem of software that work together
• Hadoop Core (API's)
• HDFS (File system)
• MapReduce (Data processing framework)
• Hive (SQL access)
• HBase (NoSQL database)
• Sqoop (Data movement)
• Oozie (Job workflow)
• …. There are is a LOT of "Hadoop" software
However, there is one common component they all build on: HDFS…
40
HDFS configuration (shared-nothing cluster)
NN DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
DN
Local
disks
NN = NameNode, which manages all the metadata
DD = DataNode, which reads/writes the file data
41
HDFS
Driving principals
• Files are stored across the entire cluster
• Programs are brought to the data, not the data to the program
Distributed file system (DFS) stores blocks across the whole cluster
• Blocks of a single file are distributed across the cluster
• A given block is typically replicated for resiliency
• Just like a regular file system, the contents of a file is up to the application
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
2
2
3
3
34
4
4
42
Hadoop I/O
Hadoop (HDFS) doesn't dictate file content/structure
• It is just a filesystem!
• It provides standard API's to list directories, open files, delete files, etc.
• In particular it allows your task to ask "where does each block live?"
Hadoop provides a framework for creating "splittable" data sources
• A data source is typically file(s), but not necessarily
• A large input is "split" into pieces, each piece to be processed in parallel
• Each split indicates the host(s) on which that split can be found
• For files, a split typically refers to an HDFS block, but not necessarily
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
1
2
3
Logical File
Splits
1
Cluster
32
App App App
Results
43
InputFormat
This splitting process is encapsulated in the InputFormat interface
• Hadoop has a large library of InputFormat's for various purposes
• You can create and provided your own as well
An InputFormat does the following
• Configured with a set of name/value pair properties
• When configured you can ask it for a list of InputSplit's
– Each input split has…
– A list of hosts on which the data for the split is recommended to be processed (optional)
– A size in bytes (optional)
• Given an InputSplit, an InputFormat can produce a RecordReader
A RecordReader does the following
• Acts as an input stream to read the contents of the split
• Produces a stream of records
• There is no fixed definition of a record – it depends upon the input type
Let's look at an example of an InputFormat…
44
InputFormat example - TextInputFormat
Purpose
• Reads input file(s) line by line, each read produces one line of text
Configuration
• Configured with the names of one or more (HDFS) files to process
Splits
• Each split it produces represents a single HDFS block of a file
RecordReader
• When opened, finds the first newline of the block it is to read
• Each read produces the next available line of text in the block
• May read into the next block of text to ensure the last line is fully read
– Even if the block is physically located on another host!!
101101
001010
010011
100111
111001
010011
101001
010010
110010
010101
101101
001010
010011
100111
111001
010011
101001
010010
110010
010101
1
2
3
Text File
(logical)
Splits Readers
Records
(lines of text)45
Hadoop MapReduce
MapReduce is a way of writing parallel processing programs
Built around InputFormat's (and OutputFormat's)
Programs are written in two pieces: Map and Reduce
Programs are submitted to the MapReduce job scheduler: JobTracker
• The JobTracker asks for the InputFormat splits
• For each split, tries to schedule the processing on a host on which the split lives
• Hosts are chosen based upon available processing resources
Program is shipped to a host and given a split to process
Output of the program is written back to HDFS
46
MapReduce - Mappers
Mappers
• Small program (typically), distributed across the cluster, local to data
• Handed a portion of the input data (called a split)
• Each mapper parses, filters, and/or transforms its input
• Produces grouped <key,value> pairs
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
Logical
Input File
1
2
3
4
1 map
sort
2 map
sort
3 map
sort
4 map
sort
reduce
reduce
copy merge
merge
10110100
10100100
11100111
11100101
00111010
01010010
11001001
10110100
10100100
11100111
11100101
00111010
01010010
11001001
Logical Output File
Logical Output File
To DFS
To DFS
Map Phase
47
MapReduce – The Shuffle
The shuffle is transparently orchestrated by MapReduce
The output of each mapper is locally grouped together by key
One node is chosen to process data for each unique key
Shuffle
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
1
2
3
4
1 map
sort
2 map
sort
3 map
sort
4 map
sort
reduce
reduce
copy merge
merge
10110100
10100100
11100111
11100101
00111010
01010010
11001001
10110100
10100100
11100111
11100101
00111010
01010010
11001001
Logical Output File
Logical Output File
To DFS
To DFS
48
MapReduce – Reduce Phase
Reducers
• Small programs (typically) that aggregate all of the values for the key
that they are responsible for
• Each reducer writes output to its own file
Reduce Phase
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
00101010
10101110
01001101
01110100
1
2
3
4
1 map
sort
2 map
sort
3 map
sort
4 map
sort
reduce
reduce
copy merge
merge
10110100
10100100
11100111
11100101
00111010
01010010
11001001
10110100
10100100
11100111
11100101
00111010
01010010
11001001
Logical Output File
Logical Output File
To DFS
To DFS
49
Joins in MapReduce
Hadoop is used to group data together at the same reducer based upon the join
key
• Mappers read blocks from each “table” in the join
• The <key> is the value of the join key, the <value> is the record to be joined
• Reducer receives a mix of records from each table with the same join key
• Reducers produce the results of the join
reduce
dept 1
reduce
dept 2
reduce
dept 3
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
1 map
2 map
2
1 map
employees
1011010
0101001
0011110
0111011
1
depts
select e.fname, e.lname, d.dept_name
from employees e, depts d
where e.salary > 30000
and d.dept_id = e.dept_id
select e.fname, e.lname, d.dept_name
from employees e, depts d
where e.salary > 30000
and d.dept_id = e.dept_id
50
Joins in MapReduce (cont.)
For N-way joins involving different join keys, multiple jobs are used
reduce
dept 1
reduce
dept 2
reduce
dept 3
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1 1 map
2 map
2
1 map
employees
1011010
0101001
0011110
0111011
1
select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number
from employees e, depts d, emp_phones p
where e.salary > 30000
and d.dept_id = e.dept_id
and p.emp_id = e.emp_id
select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number
from employees e, depts d, emp_phones p
where e.salary > 30000
and d.dept_id = e.dept_id
and p.emp_id = e.emp_id
depts
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
1011010
0101001
0011110
0111011
1
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
1011010
0101001
0011100
1111110
0101001
1010111
0111010
1
2
emp_phones
(temp files)
1 map
2 map
1 map
1 map
2 map
1 map
2 map
reduce
dept 1 reduce
emp_id 1
reduce
emp_id 2
reduce
emp_id N
results
results
results
51

Weitere ähnliche Inhalte

Was ist angesagt?

Using your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and SparkUsing your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and SparkCynthia Saracco
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark Cynthia Saracco
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase Cynthia Saracco
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab Cynthia Saracco
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesOReillyStrata
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019Dave Stokes
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudLeons Petražickis
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Cynthia Saracco
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopCloudera, Inc.
 

Was ist angesagt? (18)

Using your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and SparkUsing your DB2 SQL Skills with Hadoop and Spark
Using your DB2 SQL Skills with Hadoop and Spark
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data: Working with Big SQL data from Spark
Big Data:  Working with Big SQL data from Spark Big Data:  Working with Big SQL data from Spark
Big Data: Working with Big SQL data from Spark
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
Big Data: HBase and Big SQL self-study lab
Big Data:  HBase and Big SQL self-study lab Big Data:  HBase and Big SQL self-study lab
Big Data: HBase and Big SQL self-study lab
 
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL DatabasesSQL on Hadoop: Defining the New Generation of Analytic SQL Databases
SQL on Hadoop: Defining the New Generation of Analytic SQL Databases
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
MySQL Baics - Texas Linxufest beginners tutorial May 31st, 2019
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the CloudBest Practices for Deploying Hadoop (BigInsights) in the Cloud
Best Practices for Deploying Hadoop (BigInsights) in the Cloud
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Sqrrl and Accumulo
Sqrrl and AccumuloSqrrl and Accumulo
Sqrrl and Accumulo
 

Andere mochten auch

Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersDataWorks Summit
 
A Step By Step Guide To Put DB2 On Amazon Cloud
A Step By Step Guide To Put DB2 On Amazon CloudA Step By Step Guide To Put DB2 On Amazon Cloud
A Step By Step Guide To Put DB2 On Amazon CloudDeepak Rao
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 

Andere mochten auch (9)

Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service Providers
 
A Step By Step Guide To Put DB2 On Amazon Cloud
A Step By Step Guide To Put DB2 On Amazon CloudA Step By Step Guide To Put DB2 On Amazon Cloud
A Step By Step Guide To Put DB2 On Amazon Cloud
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 

Ähnlich wie Taming Big Data with Big SQL 3.0

Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks
 
PCM Vision 2019 Breakout: Quest Software
PCM Vision 2019 Breakout: Quest SoftwarePCM Vision 2019 Breakout: Quest Software
PCM Vision 2019 Breakout: Quest SoftwarePCM
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Andrew Underwood
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQLEDB
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLSimon Harris
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Hortonworks
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHortonworks
 
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Amazon Web Services
 
Still on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youStill on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youModusOptimum
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetupWei Ting Chen
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
Sukumar Nayak-Agile-DevOps-Cloud Management
Sukumar Nayak-Agile-DevOps-Cloud ManagementSukumar Nayak-Agile-DevOps-Cloud Management
Sukumar Nayak-Agile-DevOps-Cloud ManagementSukumar Nayak
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 

Ähnlich wie Taming Big Data with Big SQL 3.0 (20)

Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks Technical Workshop: What's New in HDP 2.3
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
PCM Vision 2019 Breakout: Quest Software
PCM Vision 2019 Breakout: Quest SoftwarePCM Vision 2019 Breakout: Quest Software
PCM Vision 2019 Breakout: Quest Software
 
Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016Dell Lustre Storage Architecture Presentation - MBUG 2016
Dell Lustre Storage Architecture Presentation - MBUG 2016
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Best Practices & Lessons Learned from Deployment of PostgreSQL
 Best Practices & Lessons Learned from Deployment of PostgreSQL Best Practices & Lessons Learned from Deployment of PostgreSQL
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Blistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQLBlistering fast access to Hadoop with SQL
Blistering fast access to Hadoop with SQL
 
Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture Delivering Apache Hadoop for the Modern Data Architecture
Delivering Apache Hadoop for the Modern Data Architecture
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Hp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar SlidesHp Converged Systems and Hortonworks - Webinar Slides
Hp Converged Systems and Hortonworks - Webinar Slides
 
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
 
Still on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for youStill on IBM BigInsights? We have the right path for you
Still on IBM BigInsights? We have the right path for you
 
20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup20150314 sahara intro and the future plan for open stack meetup
20150314 sahara intro and the future plan for open stack meetup
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Sukumar Nayak-Agile-DevOps-Cloud Management
Sukumar Nayak-Agile-DevOps-Cloud ManagementSukumar Nayak-Agile-DevOps-Cloud Management
Sukumar Nayak-Agile-DevOps-Cloud Management
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 

Mehr von Nicolas Morales

Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Nicolas Morales
 
InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014Nicolas Morales
 
IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014Nicolas Morales
 
60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easy60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easyNicolas Morales
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales
 
SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0Nicolas Morales
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesNicolas Morales
 
Security and Audit for Big Data
Security and Audit for Big DataSecurity and Audit for Big Data
Security and Audit for Big DataNicolas Morales
 

Mehr von Nicolas Morales (10)

Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
Benchmarking SQL-on-Hadoop Systems: TPC or not TPC?
 
InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014InfoSphere BigInsights for Hadoop @ IBM Insight 2014
InfoSphere BigInsights for Hadoop @ IBM Insight 2014
 
IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014IBM Big SQL @ Insight 2014
IBM Big SQL @ Insight 2014
 
60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easy60 minutes in the cloud: Predictive analytics made easy
60 minutes in the cloud: Predictive analytics made easy
 
Challenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop EngineChallenges of Building a First Class SQL-on-Hadoop Engine
Challenges of Building a First Class SQL-on-Hadoop Engine
 
SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0SQL-on-Hadoop without compromise: Big SQL 3.0
SQL-on-Hadoop without compromise: Big SQL 3.0
 
Text Analytics
Text Analytics Text Analytics
Text Analytics
 
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
 
Security and Audit for Big Data
Security and Audit for Big DataSecurity and Audit for Big Data
Security and Audit for Big Data
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 

Kürzlich hochgeladen

DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 

Kürzlich hochgeladen (20)

DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 

Taming Big Data with Big SQL 3.0

  • 1. © 2014 IBM Corporation Taming Big Data with Big SQL Session 3477 Berni Schiefer (schiefer@ca.ibm.com) Bert Van der Linden (robbert@us.ibm.com)
  • 2. Please Note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. 1
  • 3. Agenda Internet of Things & Big Data BigInsights Big SQL 3.0 • Architecture • Performance • Best practices 2
  • 4. Systems of Insight from Data Emanating from the Internet of Things (IoT) 3
  • 5. How Big is the Internet of Things? 4
  • 6. 2+ billion people on the Web by end 2011 30 billion RFID tags today (1.3B in 2005) 4.6 billion camera phones world wide 100s of millions of GPS enabled devices sold annually 76 million smart meters in 2009… 200M by 2014 12+ TBs of tweet data every day 25+ TBs of log data every day ?TBsof dataeverydayWhere is Big Data coming from? 5
  • 7. Now, they’re installing smart meters. 10 million smart meters read every 15 minutes = 350 billion transactions a year. A major gas and electric utility has 10 million meters. They read the meters once a month. The meters are read once an hour. 6
  • 8. Data AVAILABLE to an organization Data an organization can PROCESS The Big Data Conundrum The percentage of available data an enterprise can analyze is decreasing This means enterprises are getting “more naive” over time 7
  • 9. Transactional & Application Data Machine Data Social Data Enterprise Content • Volume • Structured • Throughput • Velocity • Structured • Ingestion • Variety • Unstructured • Veracity • Variety • Unstructured • Volume Big Data is All Data and All Paradigms 8
  • 10. Big Data Landing zone eco-system 9
  • 11. Big Data “Landing zone” eco-system BigSQL 10
  • 12. BI / Report ing BI / Reporting Exploration / Visualization Functional App Industry App Predictive Analytics Content Analytics Analytic Applications IBM Big Data Platform Systems Management Application Development Visualization & Discovery Accelerators Information Integration & Governance Stream Computing Data Warehouse InfoSphere BigInsights builds on open source Hadoop capabilities for enterprise class deployments Open source based components Enterprise capabilities Hadoop System Business benefits – Quicker time-to-value due to IBM technology and support – Reduced operational risk – Enhanced business knowledge with flexible analytical platform – Leverages and complements existing software InfoSphere BigInsights Administration & Security Workload Optimization Connectors Advanced Engines Visualization & Exploration Development Tools Open source Hadoop components Big SQL 11
  • 13. © 2013 IBM Corporation Announcing BigInsights v3.0 Standard Edition Breadth of capabilities Enterpriseclass Enterprise Edition - Spreadsheet-style tool - Web console - Dashboards - Pre-built applications - Eclipse tooling - RDBMS connectivity - Big SQL 3.0 - Jaql - Platform enhancements - . . . - Accelerators - GPFS – FPO - Adaptive MapReduce - Text analytics - Enterprise Integration - Monitoring and alerts - Big R - InfoSphere Streams* - Watson Explorer* - Cognos BI* - . . . Apache Hadoop 12 * Limited use license included Available for Linux on POWER (Redhat) and Intel x64 Linux (Redhat/SUSE)
  • 14. Component BigInsights 3.0 HortonWorks HDP 2.0 MapR 3.1 Pivotal HD 1.1 Cloudera CDH5 Hadoop 2.2 2.2 1.0.3 2.0.5 * 2.3 HBase 0.96.0 0.96.0 0.94.13 0.94.8 0.96.1 Hive 0.12.0 0.12 0.11 0.11.0 0.12.0 Pig 0.12.0 0.12 0.11.0 0.10.1 0.12.0 Zookeeper 3.4.5 3.4.5 3.4.5 3.4.5 3.4.5 Oozie 4.0.0 4.0.0 3.3.2 3.3.2 4.0.0 Avro 1.7.5 X X X 1.7.5 Flume 1.4.0 1.4.0 1.4.0 1.3.1 1.4.0 Sqoop 1.4.4 1.4.4 1.4.4 1.4.2 1.4.4 Current as of April 27, 2014 Common Hadoop core in all Hadoop Distributions 13
  • 15. What is Big SQL 3.0? Comprehensive SQL functionality • IBM SQL/PL support, including… • Stored procedures (SQL bodied and external) • Functions (SQL bodied and external) • IBM Data Server JDBC and ODBC drivers Leverages advanced IBM SQL compiler/runtime • High performance native (C++) runtime Replaces Map/Reduce • Advanced message passing runtime • Data flows between nodes without requiring persisting intermediate results • Continuous running daemons • Advanced workload management allows resources to remain constrained • Low latency, high throughput… SQL-based Application Big SQL Engine InfoSphere BigInsights Data Sources IBM data server client SQL MPP Run-time CSVCSV SeqSeq ParquetParquet RCRC ORCORCAvroAvro CustomCustomJSONJSON 14
  • 16. Big SQL 3.0 – Architecture Head (coordinator) node • Listens to the JDBC/ODBC connections • Compiles and optimizes the query • Coordinates the execution of the query Big SQL worker processes reside on compute nodes (some or all) Worker nodes stream data between each other as needed Workers can spill large data sets to local disk if needed • Allows Big SQL to work with data sets larger than available memory Mgmt Node Big SQL Mgmt Node Hive Metastore Mgmt Node Name Node Mgmt Node Job Tracker••• Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node••• Big SQL Big SQL Big SQL Big SQL GPFS/HDFS 15
  • 17. Big SQL 3.0 works with Hadoop All data is Hadoop data • In files in HDFS • SEQ, RC, delimited, Parquet … Never need to copy data to a proprietary representation All data is catalog-ed in the Hive metastore • It is the Hadoop catalog • It is flexible and extensible All Hadoop data is in a Hadoop filesystem • HDFS or GPFS-FPO 16
  • 18. Big SQL 3.0 – Architecture (cont.) Big SQL's runtime execution engine is all native code For common table formats a native I/O engine is utilized • e.g. delimited, RC, SEQ, Parquet, … For all others, a java I/O engine is used • Maximizes compatibility with existing tables • Allows for custom file formats and SerDe's All Big SQL built-in functions are native code Customer built UDx's can be developed in C++ or Java • Existing Big SQL UDF's can be used with a slight change in how they are registered Mgmt Node Big SQL Compute Node Task Tracker Data Node Big SQL Big SQL Worker Native I/O Engine Java I/O Engine SerDe I/O Fmt Runtime Java UDFs Native UDFs 17
  • 19. Big SQL 3.0 – Enterprise security Users may be authenticated via • Operating system • Lightweight directory access protocol (LDAP) • Kerberos User authorization mechanisms include • Full GRANT/REVOKE based security • Group and role based hierarchical security • Object level, column level, or row level (fine-grained) access controls Auditing • You may define audit policies and track user activity Transport layer security (TLS) • Protect integrity and confidentiality of data between the client and Big SQL 18
  • 20. Row Based Access Control - 4 easy steps 2) Create permissions * CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBLCREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBLCREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBLCREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE'FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE'FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE'FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE') = 1) = 1) = 1) = 1 ANDANDANDAND BRANCH_TBL.BRANCH_NAME = 'Branch_A')BRANCH_TBL.BRANCH_NAME = 'Branch_A')BRANCH_TBL.BRANCH_NAME = 'Branch_A')BRANCH_TBL.BRANCH_NAME = 'Branch_A') ENFORCED FOR ALL ACCESSENFORCED FOR ALL ACCESSENFORCED FOR ALL ACCESSENFORCED FOR ALL ACCESS ENABLEENABLEENABLEENABLE 3) Enable access control * ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROLALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROLALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROLALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL 4) Select as Branch_A user CONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newton SELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBL EMP_NO FIRST_NAME BRANCH_NAMEEMP_NO FIRST_NAME BRANCH_NAMEEMP_NO FIRST_NAME BRANCH_NAMEEMP_NO FIRST_NAME BRANCH_NAME -------------------------------------------- ------------------------------------------------ -------------------------------------------- 2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A 3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A 5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A 8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A 4 record(s) selected.4 record(s) selected.4 record(s) selected.4 record(s) selected. Data SELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBLSELECT "*" FROM BRANCH_TBL EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY ---------------------------- ------------------------------------------------ -------------------------------------------- 1 Steve Branch_B1 Steve Branch_B1 Steve Branch_B1 Steve Branch_B 2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A2 Chris Branch_A 3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A3 Paula Branch_A 4 Craig Branch_B4 Craig Branch_B4 Craig Branch_B4 Craig Branch_B 5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A5 Pete Branch_A 6 Stephanie Branch_B6 Stephanie Branch_B6 Stephanie Branch_B6 Stephanie Branch_B 7 Julie Branch_B7 Julie Branch_B7 Julie Branch_B7 Julie Branch_B 8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A8 Chrissie Branch_A 1) Create and grant access and roles * CREATE ROLE BRANCH_A_ROLECREATE ROLE BRANCH_A_ROLECREATE ROLE BRANCH_A_ROLECREATE ROLE BRANCH_A_ROLE GRANT ROLE BRANCH_A_ROLE TO USER newtonGRANT ROLE BRANCH_A_ROLE TO USER newtonGRANT ROLE BRANCH_A_ROLE TO USER newtonGRANT ROLE BRANCH_A_ROLE TO USER newton GRANT SELECT ON BRANCH_TBL TO USER newtonGRANT SELECT ON BRANCH_TBL TO USER newtonGRANT SELECT ON BRANCH_TBL TO USER newtonGRANT SELECT ON BRANCH_TBL TO USER newton * Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with SECADM authority.SECADM authority.SECADM authority.SECADM authority. 19
  • 21. Column Based Access Control 2) Create permissions * CREATE MASK SALARY_MASK ON SAL_TBL FORCREATE MASK SALARY_MASK ON SAL_TBL FORCREATE MASK SALARY_MASK ON SAL_TBL FORCREATE MASK SALARY_MASK ON SAL_TBL FOR COLUMN SALARY RETURNCOLUMN SALARY RETURNCOLUMN SALARY RETURNCOLUMN SALARY RETURN CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1 THEN SALARYTHEN SALARYTHEN SALARYTHEN SALARY ELSE 0.00ELSE 0.00ELSE 0.00ELSE 0.00 ENDENDENDEND ENABLEENABLEENABLEENABLE 3) Enable access control * ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROLALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROLALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROLALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL 4b) Select as a MANAGER CONNECT TO TESTDB USER socratesCONNECT TO TESTDB USER socratesCONNECT TO TESTDB USER socratesCONNECT TO TESTDB USER socrates SELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY ---------------------------- ------------------------------------------------ -------------------------------------------- 1 Steve 2500001 Steve 2500001 Steve 2500001 Steve 250000 2 Chris 2000002 Chris 2000002 Chris 2000002 Chris 200000 3 Paula 10000003 Paula 10000003 Paula 10000003 Paula 1000000 3 record(s) selected.3 record(s) selected.3 record(s) selected.3 record(s) selected. Data SELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY ---------------------------- ------------------------------------------------ -------------------------------------------- 1 Steve 2500001 Steve 2500001 Steve 2500001 Steve 250000 2 Chris 2000002 Chris 2000002 Chris 2000002 Chris 200000 3 Paula 10000003 Paula 10000003 Paula 10000003 Paula 1000000 1) Create and grant access and roles * CREATE ROLE MANAGERCREATE ROLE MANAGERCREATE ROLE MANAGERCREATE ROLE MANAGER CREATE ROLE EMPLOYEECREATE ROLE EMPLOYEECREATE ROLE EMPLOYEECREATE ROLE EMPLOYEE GRANT SELECT ON SAL_TBL TO USER socratesGRANT SELECT ON SAL_TBL TO USER socratesGRANT SELECT ON SAL_TBL TO USER socratesGRANT SELECT ON SAL_TBL TO USER socrates GRANT SELECT ON SAL_TBL TO USER newtonGRANT SELECT ON SAL_TBL TO USER newtonGRANT SELECT ON SAL_TBL TO USER newtonGRANT SELECT ON SAL_TBL TO USER newton GRANT ROLE MANAGER TO USER socratesGRANT ROLE MANAGER TO USER socratesGRANT ROLE MANAGER TO USER socratesGRANT ROLE MANAGER TO USER socrates GRANT ROLE EMPLOYEE TO USER newtonGRANT ROLE EMPLOYEE TO USER newtonGRANT ROLE EMPLOYEE TO USER newtonGRANT ROLE EMPLOYEE TO USER newton 4a) Select as an EMPLOYEE CONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newtonCONNECT TO TESTDB USER newton SELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBLSELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARYEMP_NO FIRST_NAME SALARY ---------------------------- ------------------------------------------------ -------------------------------------------- 1 Steve 01 Steve 01 Steve 01 Steve 0 2 Chris 02 Chris 02 Chris 02 Chris 0 3 Paula 03 Paula 03 Paula 03 Paula 0 3 record(s) selected.3 record(s) selected.3 record(s) selected.3 record(s) selected. * Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with* Note: Steps 1, 2, and 3 are done by a user with SECADM authority.SECADM authority.SECADM authority.SECADM authority. 20
  • 22. Big SQL 3.0 – Other enterprise features Federation • Join between your Hadoop data and other external relational platforms • Optimizer determines most efficient execution path Open integration across Business Analytic Tools • IBM Optim Data Studio performance tool portfolio • Superior enablement for IBM Software – e.g. Cognos • Enhanced support by 3rd party software – e.g. Microstrategy Mixed workload cluster management • Capacity sharing with the rest of the cluster – Specify %cpu and %memory to dedicate to BigSQL 3.0 • SQL based workload management • Integration with Platform Symphony to manage mixed cluster workloads Support for standard development tools 21
  • 23. Workload Management (2) Identify workloads and associate to service classes create workload SALES_WL CURRENT client_appname(‘SalesSys') service class HIGHPRIWORK create workload ITEMCOUNT_WL CURRENT client_appname(‘InventorySys') service class LOWPRIWORK (1) Create service classes create service class BIGDATAWORK create service class HIGHPRIWORK under BIGDATAWORK create service class LOWPRIWORK under BIGDATAWORK (3a) Avoid thrashing by queueing low priority work. create threshold LOW_CONCURRENT for service class LOWPRIWORK under BIGDATAWORK activities enforcement database enable when concurrentdbcoordactivities > 5 and queued activities unbounded continue (3b) Stop high priority job if SLA cannot be met create threshold HIGH_CONCURRENT for service class HEAVYQUERIES under BIGDATAWORK activities enforcement database enable when concurrentdbcoordactivities > 30 and queued activities > 0 stop execution (4a) Stop very long running jobs create threshold LOWPRI_WL_TIMEOUT for service class LOWPRIWORK under BIGDATAWORK activities enforcement database enable when activitytotaltime > 30 minutes stop execution (4b) Stop jobs that return too many rows create threshold TOO_MANY_ROWS_RETURNED for service class HIGHPRIWORK under BIGDATAWORK enforcement database when sqlrowsreturned >30 stop execution (5) Collect data for long running jobs Create threshold LONGRUNINVENTORYACTIVITIES for service class LOWPRIWORK activities enforcement database when activitytotaltime > 15 minutes collect activity data with details continue (6) Reporting system activity create event monitor BIGDATAMONACT for activities write to table
  • 24. Using existing standard SQL tools: Eclipse •Using existing SQL tooling against BigData, •Same setup as for existing SQL sources!! •Support for “standard” authentication!! 23
  • 25. Using existing standard SQL tools: SQuirrel SQL •Using existing SQL tooling against BigData •Support for authenticating (not supported for Hive, BUT supported by Big SQL!) 24
  • 26. Using BigSheets in BigInsights: data discovery •Discovery and analytics in a spreadsheet-like environment.
  • 27. Big SQL 3.0 – Performance Query rewrites • Exhaustive query rewrite capabilities • Leverages additional metadata such as constraints and nullability Optimization • Statistics and heuristic driven query optimization • Query optimizer based upon decades of IBM RDBMS experience Tools and metrics • Highly detailed explain plans and query diagnostic tools • Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKE Y AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Thread 0 DSS TQA (tq1) AGG (complete) BNO EXT Thread 1 TA (Product) NLJN (Daily Sales) NLJN (Period) NLJN (Store) AGG (partial) TQB (tq1) EXT Thread 2 TA (DS_IX7) EXT Thread 3 TA (PER_IX2) EXT Thread 4 TA (ST_IX1) EXT Access plan generationQuery transformation Access section ~150 query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily SalesNLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product StoreZZJOIN Daily Sales HSJOIN Period 26
  • 28. Statistics are key to performance Table statistics: • Cardinality (count) • Number of Files • Total File Size Column statistics (this applies to column group stats also): • Minimum value • Maximum value • Cardinality (non-nulls) • Distribution (Number of Distinct Values) • Number of null values • Average Length of the column value (for string columns) • Histogram • Frequent Values (MFV) 27
  • 29. Performance, Benchmarking, Benchmarketing Performance matters to customers Benchmarking appeals to Engineers to drive product innovation Benchmarketing used to convey performance in a memorable and appealing way SQL over Hadoop is in the “Wild West” of Benchmarketing • 100x claims! Compared to what? Conforming to what rules? The TPC (Transaction Processing Performance Council) is the grand-daddy of all multi-vendor SQL-oriented organizations • Formed in August, 1988 • TPC-H and TPC-DS are the most relevant to SQL over Hadoop – R/W nature of workload not suitable for HDFS Big Data Benchmarking Community (BDBC) formed 28
  • 30. Power and Performance of Standard SQL Everyone loves performance numbers, but that's not the whole story • How much work do you have to do to achieve those numbers? A portion of our internal performance numbers are based upon read-only versions of TPC benchmarks Big SQL is capable of executing • All 22 TPC-H queries without modification • All 99 TPC-DS queries without modification SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written for Hive 29
  • 31. 30 Comparing Big SQL and Hive 0.12 for Ad-Hoc Queries *Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 Big SQL is up to 41x faster than Hive 0.12 Big SQL is up to 41x faster than Hive 0.12
  • 32. 31 Comparing Big SQL and Hive 0.12 for Decision Support Queries * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updates are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 Big SQL is 10x faster than Hive 0.12 (Total elapsed time) Big SQL is 10x faster than Hive 0.12 (Total elapsed time)
  • 33. How many times Faster is Big SQL than Hive 0.12? * Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Modern BI Workload" in a controlled laboratory environment. The 1TB Modern BI Workload is a workload derived from the TPC-DS Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no updats are performed, and only 43 out of 99 queries are executed. The test measured sequential query execution of all 43 queries for which Hive syntax was publically available. TPC Benchmark and TPC-DS are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 Max Speedup of 74x Max Speedup of 74x 32 Queries sorted by speed up ratio (worst to best) Avg Speedup of 20x Avg Speedup of 20x
  • 34. Big SQL 3.0 Best Practices Ensure you have a homogenous and balanced cluster • Utilize IBM reference architecture Choose an optimized file format (if possible) • ORC or Parquet Choose appropriate data types • Use the smallest and most precise datatype available Define informational constraints • Primary key, foreign key, check constraints Ensure you have good statistics • Current and comprehensive Use the full power of SQL available to you • Don’t constrain yourself to Hive syntax/capability 33
  • 35. BigInsights Big SQL 3.0: Summary Big SQL provides rich, robust, standards-based SQL support for data stored in BigInsights • Uses IBM common client ODBC/JDBC drivers Big SQL fully integrates with SQL applications and tools • Existing queries run with no or few modifications* • Existing JDBC and ODBC compliant tools can be leveraged Big SQL provides faster and more reliable performance • Big SQL uses more efficient access paths to the data • Queries processed by Big SQL no longer need to use MapReduce • Big SQL is optimized to more efficiently move data over the network Big SQL provides and enterprise grade data management • Security, Auditing, workload management … 34
  • 37. We Value Your Feedback Don’t forget to submit your Impact session and speaker feedback! Your feedback is very important to us – we use it to continually improve the conference. Use the Conference Mobile App or the online Agenda Builder to quickly submit your survey • Navigate to “Surveys” to see a view of surveys for sessions you’ve attended 36
  • 39. Legal Disclaimer • © IBM Corporation 2014. All Rights Reserved. • The information contained in this publication is provided for informational purposes only. While efforts were made to verify the completeness and accuracy of the information contained in this publication, it is provided AS IS without warranty of any kind, express or implied. In addition, this information is based on IBM’s current product plans and strategy, which are subject to change by IBM without notice. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this publication or any other materials. Nothing contained in this publication is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. • References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. Product release dates and/or capabilities referenced in this presentation may change at any time at IBM’s sole discretion based on market opportunities or other factors, and are not intended to be a commitment to future product or feature availability in any way. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. • If the text contains performance statistics or references to benchmarks, insert the following language; otherwise delete: Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user's job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. • If the text includes any customer examples, please confirm we have prior written approval from such customer and insert the following language; otherwise delete: All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. • Please review text for proper trademark attribution of IBM products. At first use, each product name must be the full name and include appropriate trademark symbols (e.g., IBM Lotus® Sametime® Unyte™). Subsequent references can drop “IBM” but should include the proper branding (e.g., Lotus Sametime Gateway, or WebSphere Application Server). Please refer to http://www.ibm.com/legal/copytrade.shtml for guidance on which trademarks require the ® or ™ symbol. Do not use abbreviations for IBM product names in your presentation. All product names must be used as adjectives rather than nouns. Please list all of the trademarks that you use in your presentation as follows; delete any not included in your presentation. IBM, the IBM logo, Lotus, Lotus Notes, Notes, Domino, Quickr, Sametime, WebSphere, UC2, PartnerWorld and Lotusphere are trademarks of International Business Machines Corporation in the United States, other countries, or both. Unyte is a trademark of WebDialogs, Inc., in the United States, other countries, or both. • If you reference Adobe® in the text, please mark the first use and include the following; otherwise delete: Adobe, the Adobe logo, PostScript, and the PostScript logo are either registered trademarks or trademarks of Adobe Systems Incorporated in the United States, and/or other countries. • If you reference Java™ in the text, please mark the first use and include the following; otherwise delete: Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. • If you reference Microsoft® and/or Windows® in the text, please mark the first use and include the following, as applicable; otherwise delete: Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both. • If you reference Intel® and/or any of the following Intel products in the text, please mark the first use and include those that you use as follows; otherwise delete: Intel, Intel Centrino, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. • If you reference UNIX® in the text, please mark the first use and include the following; otherwise delete: UNIX is a registered trademark of The Open Group in the United States and other countries. • If you reference Linux® in your presentation, please mark the first use and include the following; otherwise delete: Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. • If the text/graphics include screenshots, no actual IBM employee names may be used (even your own), if your screenshots include fictitious company names (e.g., Renovations, Zeta Bank, Acme) please update and insert the following; otherwise delete: All references to [insert fictitious company name] refer to a fictitious company and are used for illustration purposes only. 38
  • 41. What is Hadoop? Hadoop is not a piece of software, you can't install "hadoop" It is an ecosystem of software that work together • Hadoop Core (API's) • HDFS (File system) • MapReduce (Data processing framework) • Hive (SQL access) • HBase (NoSQL database) • Sqoop (Data movement) • Oozie (Job workflow) • …. There are is a LOT of "Hadoop" software However, there is one common component they all build on: HDFS… 40
  • 42. HDFS configuration (shared-nothing cluster) NN DN Local disks DN Local disks DN Local disks DN Local disks DN Local disks DN Local disks DN Local disks DN Local disks DN Local disks DN Local disks NN = NameNode, which manages all the metadata DD = DataNode, which reads/writes the file data 41
  • 43. HDFS Driving principals • Files are stored across the entire cluster • Programs are brought to the data, not the data to the program Distributed file system (DFS) stores blocks across the whole cluster • Blocks of a single file are distributed across the cluster • A given block is typically replicated for resiliency • Just like a regular file system, the contents of a file is up to the application 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 Logical File 1 2 3 4 Blocks 1 Cluster 1 1 2 2 2 3 3 34 4 4 42
  • 44. Hadoop I/O Hadoop (HDFS) doesn't dictate file content/structure • It is just a filesystem! • It provides standard API's to list directories, open files, delete files, etc. • In particular it allows your task to ask "where does each block live?" Hadoop provides a framework for creating "splittable" data sources • A data source is typically file(s), but not necessarily • A large input is "split" into pieces, each piece to be processed in parallel • Each split indicates the host(s) on which that split can be found • For files, a split typically refers to an HDFS block, but not necessarily 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 1 2 3 Logical File Splits 1 Cluster 32 App App App Results 43
  • 45. InputFormat This splitting process is encapsulated in the InputFormat interface • Hadoop has a large library of InputFormat's for various purposes • You can create and provided your own as well An InputFormat does the following • Configured with a set of name/value pair properties • When configured you can ask it for a list of InputSplit's – Each input split has… – A list of hosts on which the data for the split is recommended to be processed (optional) – A size in bytes (optional) • Given an InputSplit, an InputFormat can produce a RecordReader A RecordReader does the following • Acts as an input stream to read the contents of the split • Produces a stream of records • There is no fixed definition of a record – it depends upon the input type Let's look at an example of an InputFormat… 44
  • 46. InputFormat example - TextInputFormat Purpose • Reads input file(s) line by line, each read produces one line of text Configuration • Configured with the names of one or more (HDFS) files to process Splits • Each split it produces represents a single HDFS block of a file RecordReader • When opened, finds the first newline of the block it is to read • Each read produces the next available line of text in the block • May read into the next block of text to ensure the last line is fully read – Even if the block is physically located on another host!! 101101 001010 010011 100111 111001 010011 101001 010010 110010 010101 101101 001010 010011 100111 111001 010011 101001 010010 110010 010101 1 2 3 Text File (logical) Splits Readers Records (lines of text)45
  • 47. Hadoop MapReduce MapReduce is a way of writing parallel processing programs Built around InputFormat's (and OutputFormat's) Programs are written in two pieces: Map and Reduce Programs are submitted to the MapReduce job scheduler: JobTracker • The JobTracker asks for the InputFormat splits • For each split, tries to schedule the processing on a host on which the split lives • Hosts are chosen based upon available processing resources Program is shipped to a host and given a split to process Output of the program is written back to HDFS 46
  • 48. MapReduce - Mappers Mappers • Small program (typically), distributed across the cluster, local to data • Handed a portion of the input data (called a split) • Each mapper parses, filters, and/or transforms its input • Produces grouped <key,value> pairs 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 Logical Input File 1 2 3 4 1 map sort 2 map sort 3 map sort 4 map sort reduce reduce copy merge merge 10110100 10100100 11100111 11100101 00111010 01010010 11001001 10110100 10100100 11100111 11100101 00111010 01010010 11001001 Logical Output File Logical Output File To DFS To DFS Map Phase 47
  • 49. MapReduce – The Shuffle The shuffle is transparently orchestrated by MapReduce The output of each mapper is locally grouped together by key One node is chosen to process data for each unique key Shuffle 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 1 2 3 4 1 map sort 2 map sort 3 map sort 4 map sort reduce reduce copy merge merge 10110100 10100100 11100111 11100101 00111010 01010010 11001001 10110100 10100100 11100111 11100101 00111010 01010010 11001001 Logical Output File Logical Output File To DFS To DFS 48
  • 50. MapReduce – Reduce Phase Reducers • Small programs (typically) that aggregate all of the values for the key that they are responsible for • Each reducer writes output to its own file Reduce Phase 10110100 10100100 11100111 11100101 00111010 01010010 11001001 01010011 00010100 10111010 11101011 11011011 01010110 10010101 00101010 10101110 01001101 01110100 1 2 3 4 1 map sort 2 map sort 3 map sort 4 map sort reduce reduce copy merge merge 10110100 10100100 11100111 11100101 00111010 01010010 11001001 10110100 10100100 11100111 11100101 00111010 01010010 11001001 Logical Output File Logical Output File To DFS To DFS 49
  • 51. Joins in MapReduce Hadoop is used to group data together at the same reducer based upon the join key • Mappers read blocks from each “table” in the join • The <key> is the value of the join key, the <value> is the record to be joined • Reducer receives a mix of records from each table with the same join key • Reducers produce the results of the join reduce dept 1 reduce dept 2 reduce dept 3 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 1 map 2 map 2 1 map employees 1011010 0101001 0011110 0111011 1 depts select e.fname, e.lname, d.dept_name from employees e, depts d where e.salary > 30000 and d.dept_id = e.dept_id select e.fname, e.lname, d.dept_name from employees e, depts d where e.salary > 30000 and d.dept_id = e.dept_id 50
  • 52. Joins in MapReduce (cont.) For N-way joins involving different join keys, multiple jobs are used reduce dept 1 reduce dept 2 reduce dept 3 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 1 map 2 map 2 1 map employees 1011010 0101001 0011110 0111011 1 select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number from employees e, depts d, emp_phones p where e.salary > 30000 and d.dept_id = e.dept_id and p.emp_id = e.emp_id select e.fname, e.lname, d.dept_name, p.phone_type, p.phone_number from employees e, depts d, emp_phones p where e.salary > 30000 and d.dept_id = e.dept_id and p.emp_id = e.emp_id depts 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 2 1011010 0101001 0011110 0111011 1 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 2 1011010 0101001 0011100 1111110 0101001 1010111 0111010 1 2 emp_phones (temp files) 1 map 2 map 1 map 1 map 2 map 1 map 2 map reduce dept 1 reduce emp_id 1 reduce emp_id 2 reduce emp_id N results results results 51