SlideShare ist ein Scribd-Unternehmen logo
1 von 34
From Oracle to Hadoop: 
Unlocking Hadoop for Your RDBMS with 
Apache Sqoop and Other Tools 
Guy Harrison, David Robson, Kate Ting 
{guy.harrison, david.robson}@software.dell.com, 
kate@cloudera.com 
October 16, 2014
About Guy, David, & Kate 
Guy Harrison @guyharrison 
- Executive Director of R&D @ Dell 
- Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming 
David Robson @DavidR021 
- Principal Technologist @ Dell 
- Sqoop Committer, Lead on Toad for Hadoop & OraOop 
Kate Ting @kate_ting 
- Technical Account Mgr @ Cloudera 
- Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
RDBMS and Hadoop 
 The relational database reigned 
supreme for more than two decades 
 Hadoop and other non-relational 
tools have overthrown that 
hegemony 
 We are unlikely to return to a “one 
size fits all” model based on Hadoop 
- Though some will try  
 For the foreseeable future, enterprise 
information architectures will include 
relational and non-relational stores
Scenarios 
1. We need to access RDBMS 
to make sense of Hadoop 
data 
Analytic output 
YARN/ 
MR1 
HDFS 
Weblogs 
Products 
RDBMS 
Flume SQOOP
Scenarios 
1. Reference data is in the 
RDBMS 
2. We want to run analysis 
outside of the RDBMS 
Analytic output 
HDFS 
Products 
RDBMS 
SQOOP 
YARN/ 
MR1 
Sales 
SQOOP
Scenarios 
1. Reference data is in the 
RDBMS 
2. We want to run analysis 
outside of the RDBMS 
3. Feeding YARN/MR output 
into RDBMS 
Analytic output 
HDFS 
Weblogs 
Weblog 
Summary 
RDBMS 
Flume 
SQOOP 
YARN/ 
MR1
Scenarios 
1. We need to access RDBMS 
to make sense of Hadoop 
data 
2. We want to use Hadoop to 
analyse RDBMS data 
3. Hadoop output belongs in 
RDBMS Data warehouse 
4. We archive old RDBMS 
data to Hadoop 
HDFS 
BI platform 
Sales 
RDBMS 
SQOOP 
HQL 
Old Sales 
SQL
SQOOP 
 SQOOP was created in 2009 
by Aaron Kimball as a means 
of moving data between SQL 
databases and Hadoop 
 It provided a generic 
implementation for moving 
data 
 It also provided a framework 
for implementing database 
specific optimized 
connectors
How SQOOP works (import) 
Hive Table 
HDFS 
Table 
Metadata 
Table 
Data 
RDBMS 
Hive DDL 
Table.java SQOOP 
Map Task 
FileOutputFormat 
DataDrivenDBInputFormat 
Map Task 
DataDrivenDBInputFormat 
FileOutputFormat 
HDFS files
SQOOP & Oracle
SQOOP issues with Oracle 
 SQOOP uses primary key 
ranges to divide up data 
between mappers 
 However, the deletes hit older 
key values harder, making key 
ranges unbalanced. 
 Data is almost never arranged 
on disk in key order so index 
scans collide on disk 
 Load is unbalanced, and IO 
block requests >> blocks in the 
table. 
ORACLE TABLE on DISK 
ID > 0 and ID < 
MAX/2 
MAPPER 
ORACLE SESSION 
RANGE SCAN 
Index block Index block 
ID > MAX/2 
MAPPER 
ORACLE SESSION 
RANGE SCAN 
Index block Index block 
Index block Index block
Other problems 
 Oracle might run each mapper using a 
full scan – clobbering the database 
 Oracle might run each mapper in 
parallel – clobbering the database 
 Sqoop may clobber the database 
cache 
1800 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
0 2 4 6 8 10 12 14 16 18 
Elasped time (s) 
7000 
6000 
5000 
4000 
3000 
2000 
1000 
Database load 
0 Number of mappers 
0 4 8 12 16 20 24 
Database Time (s) 
Number of mappers
High speed connector design 
 Partition data based on physical 
storage 
 By-pass Oracle buffering 
 By-pass Oracle parallelism 
 Do not require or use indexes 
 Never read the same data block more 
than once 
 Support Oracle datatypes 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Imports (Oracle->Hadoop) 
 Uses Oracle block/extent map to equally 
divide IO 
 Uses Oracle direct path (non-buffered) 
IO for all reads 
 Round-robin, sequential or random 
allocation 
 All mappers get an equal number of 
blocks & no block is read twice 
 If table is partitioned, each mapper can 
work on a separate partition – results in 
partitioned output 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Exports (Hadoop-> Oracle) 
 Optionally leverages Oracle 
partitions and temporary tables for 
parallel writes 
 Performs MERGE into Oracle table 
(Updates existing rows, inserts new 
rows) 
 Optionally use oracle NOLOGGING 
(faster but unrecoverable) 
ORACLE 
TABLE 
HDFS 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION 
HADOOP 
MAPPER 
ORACLE 
SESSION
Import – Oracle to Hadoop 
 When data is unclustered 
(randomly distributed by PK), old 
SQOOP scales poorly 
 Clustered data shows better 
scalability but is still much slower 
than the direct approach. 
 New SQOOP outperforms 5-20 
times typically 
 We’ve seen limiting factor as: 
- Data IO bandwidth, or 
- Network out of DB, or 
- Hadoop CPU 
1600 
1400 
1200 
1000 
800 
600 
400 
200 
0 
0 5 10 15 20 25 30 35 
Elapsed time (s) 
Number of mappers 
direct=false - unclustered Data direct=false clustered data direct=true
Import - Database overhead 
 As you increase mappers in old sqoop, 
database load increases rapidly 
- (sometimes non-linear) 
 In new Sqoop, queuing occurs only after 
IO bandwidth is exceeded 
3000 
2500 
2000 
1500 
1000 
500 
0 
0 4 8 12 16 20 24 
DB time (minutes) 
Number of mappers 
Sqoop 
Direct
Export – Oracle to Hadoop 
 On Export, old SQOOP would hit 
database writer bottleneck early on 
and fail to parallelize. 
 New SQOOP uses partitioning and 
direct path inserts. 
 Typically bottlenecks on write IO on 
Oracle side 
120 
100 
80 
60 
40 
20 
0 
0 4 8 12 16 20 24 
Elapsed time (minutes) 
Number of mappers 
Sqoop 
Direct
Reduction in database load 
 45% reduction in DB CPU 
 83% reduction in elapsed time 
 90% reduction in total database 
time 
 99.9% reduction in database IO 
8 node Hadoop cluster, 1B rows, 310GB 
55.31 
83.45 
90.59 
99.98 
99.28 
0 20 40 60 80 100 
IO time 
IO requests 
DB time 
Elapsed time 
CPU time 
% reduction
Replication 
 No matter how fast we make SQOOP, 
it’s a drag to have to run a SQOOP job 
before every Hadoop job. 
 Replicating data into Hadoop cuts 
down on SQOOP overhead on both 
sides and avoids stale data. 
Shareplex® for Oracle and Hadoop
Sqoop 1.4.5 Summary 
Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct 
Minimal privileges required Access to DBA views required 
Works on most object types: e.g. IOT 5x-20x faster performance on tables 
Favors Sqoop terminology Favors Oracle terminology 
Database load increases non-linearly Up to 99% reduction in database IO
Future of SQOOP
Sqoop 1 Import Architecture 
sqoop import  
--connect jdbc:mysql://mysql.example.com/sqoop  
--username sqoop --password sqoop  
--table cities
Sqoop 1 Export Architecture 
sqoop export  
--connect jdbc:mysql://mysql.example.com/sqoop  
--username sqoop --password sqoop  
--table cities  
--export-dir /temp/cities
Sqoop 1 Challenges 
 Concerns with usability 
- Cryptic, contextual command line 
arguments 
 Concerns with security 
- Client access to Hadoop bin/config, DB 
 Concerns with extensibility 
- Connectors tightly coupled with data 
format
Sqoop 2 Design Goals 
 Ease of use 
- REST API and Java API 
 Ease of security 
- Separation of responsibilities 
 Ease of extensibility 
- Connector SDK, focus on pluggability
Ease of Use 
Sqoop 1 Sqoop 2 
sqoop import  
- 
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura 
ndom“ 
-Ddfs.replication=1  
-Dmapred.map.tasks.speculative.execution=false  
--num-mappers 4  
--hive-import --hive-table CUSTOMERS --create-hive-table  
--connect jdbc:oracle:thin:@//localhost:1521/g12c  
--username OPSG --password opsg --table 
OPSG.CUSTOMERS  
--target-dir CUSTOMERS.CUSTOMERS
Ease of Security 
Sqoop 1 Sqoop 2 
sqoop import  
- 
Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura 
ndom“ 
-Ddfs.replication=1  
-Dmapred.map.tasks.speculative.execution=false  
--num-mappers 4  
--hive-import --hive-table CUSTOMERS --create-hive-table  
--connect jdbc:oracle:thin:@//localhost:1521/g12c  
--username OPSG --password opsg --table 
OPSG.CUSTOMERS  
--target-dir CUSTOMERS.CUSTOMERS 
• Role-based access to connection objects 
• Prevents misuse and abuse 
• Administrators create, edit, delete 
• Operators use
Ease of Extensibility 
Sqoop 1 Sqoop 2 
Tight Coupling 
• Connectors fetch and store 
data from db 
• Framework handles 
serialization, format 
conversion, integration
Takeaway 
 Apache Sqoop 
- Bulk data transfer tool between external structured datastores and Hadoop 
 Sqoop 1.4.5 now with a --direct parameter option for Oracle 
- 5x-20x performance improvement on Oracle table imports 
 Sqoop 2 
- Ease of use, security, extensibility
Questions? 
Guy Harrison @guyharrison 
David Robson @DavidR021 
Kate Ting @kate_ting 
Visit Dell at Booth #102 
Visit Cloudera at Booth #305 
Book Signing: Today @ 3:15pm 
Office Hours: Tomorrow @ 11am

Weitere ähnliche Inhalte

Was ist angesagt?

HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application ResourcesDataWorks Summit
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache HiveAvkash Chauhan
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Ansible with oci
Ansible with ociAnsible with oci
Ansible with ociDonghuKIM2
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptxSadhik7
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Zookeeper 활용 nifi clustering
Zookeeper 활용 nifi clusteringZookeeper 활용 nifi clustering
Zookeeper 활용 nifi clusteringNoahKIM36
 
Bash Script Disk Space Utilization Report and EMail
Bash Script Disk Space Utilization Report and EMailBash Script Disk Space Utilization Report and EMail
Bash Script Disk Space Utilization Report and EMailVCP Muthukrishna
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path HBaseCon
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 

Was ist angesagt? (20)

HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Introduction to Apache Hive
Introduction to Apache HiveIntroduction to Apache Hive
Introduction to Apache Hive
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Ansible with oci
Ansible with ociAnsible with oci
Ansible with oci
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
HBase.pptx
HBase.pptxHBase.pptx
HBase.pptx
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
HDFS Federation
HDFS FederationHDFS Federation
HDFS Federation
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Zookeeper 활용 nifi clustering
Zookeeper 활용 nifi clusteringZookeeper 활용 nifi clustering
Zookeeper 활용 nifi clustering
 
Bash Script Disk Space Utilization Report and EMail
Bash Script Disk Space Utilization Report and EMailBash Script Disk Space Utilization Report and EMail
Bash Script Disk Space Utilization Report and EMail
 
HDFS Overview
HDFS OverviewHDFS Overview
HDFS Overview
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 

Andere mochten auch

Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Guy Harrison
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and OracleTanel Poder
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop MeetupSqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetupaaamase
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database huguk
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015Guy Harrison
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop UsersKathleen Ting
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Cloudera, Inc.
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Guy Harrison
 
Replication in Distributed Real Time Database
Replication in Distributed Real Time DatabaseReplication in Distributed Real Time Database
Replication in Distributed Real Time DatabaseGhanshyam Yadav
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Kathleen Ting
 

Andere mochten auch (20)

Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Advanced Sqoop
Advanced Sqoop Advanced Sqoop
Advanced Sqoop
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop Hadoop and rdbms with sqoop
Hadoop and rdbms with sqoop
 
Connecting Hadoop and Oracle
Connecting Hadoop and OracleConnecting Hadoop and Oracle
Connecting Hadoop and Oracle
 
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop MeetupSqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
Sqoop2 refactoring for generic data transfer - Hadoop Strata Sqoop Meetup
 
Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database Apache Sqoop: Unlocking Hadoop for Your Relational Database
Apache Sqoop: Unlocking Hadoop for Your Relational Database
 
SQOOP - RDBMS to Hadoop
SQOOP - RDBMS to HadoopSQOOP - RDBMS to Hadoop
SQOOP - RDBMS to Hadoop
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Five database trends - updated April 2015
Five database trends - updated April 2015Five database trends - updated April 2015
Five database trends - updated April 2015
 
Habits of Effective Sqoop Users
Habits of Effective Sqoop UsersHabits of Effective Sqoop Users
Habits of Effective Sqoop Users
 
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)Top 10 tips for Oracle performance (Updated April 2015)
Top 10 tips for Oracle performance (Updated April 2015)
 
Replication in Distributed Real Time Database
Replication in Distributed Real Time DatabaseReplication in Distributed Real Time Database
Replication in Distributed Real Time Database
 
Oracle in Database Hadoop
Oracle in Database HadoopOracle in Database Hadoop
Oracle in Database Hadoop
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL Sqooping 50 Million Rows a Day from MySQL
Sqooping 50 Million Rows a Day from MySQL
 

Ähnlich wie From oracle to hadoop with Sqoop and other tools

October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraAndrey Kudryavtsev
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Laurent Leturgez
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 

Ähnlich wie From oracle to hadoop with Sqoop and other tools (20)

SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Oracle hadoop let them talk together !
Oracle hadoop let them talk together !Oracle hadoop let them talk together !
Oracle hadoop let them talk together !
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 

Mehr von Guy Harrison

Thriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionThriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionGuy Harrison
 
Mega trends in information management
Mega trends in information managementMega trends in information management
Mega trends in information managementGuy Harrison
 
Big datacamp2013 share
Big datacamp2013 shareBig datacamp2013 share
Big datacamp2013 shareGuy Harrison
 
Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Guy Harrison
 
Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Guy Harrison
 
Making the most of ssd in oracle11g
Making the most of ssd in oracle11gMaking the most of ssd in oracle11g
Making the most of ssd in oracle11gGuy Harrison
 
Oracle sql high performance tuning
Oracle sql high performance tuningOracle sql high performance tuning
Oracle sql high performance tuningGuy Harrison
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010Guy Harrison
 
Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Guy Harrison
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Guy Harrison
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Guy Harrison
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance PlsqlGuy Harrison
 
Performance By Design
Performance By DesignPerformance By Design
Performance By DesignGuy Harrison
 
Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Guy Harrison
 
Thanks for the Memory
Thanks for the MemoryThanks for the Memory
Thanks for the MemoryGuy Harrison
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performanceGuy Harrison
 
How I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleHow I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleGuy Harrison
 
Performance By Design
Performance By DesignPerformance By Design
Performance By DesignGuy Harrison
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance PlsqlGuy Harrison
 

Mehr von Guy Harrison (19)

Thriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolutionThriving and surviving the Big Data revolution
Thriving and surviving the Big Data revolution
 
Mega trends in information management
Mega trends in information managementMega trends in information management
Mega trends in information management
 
Big datacamp2013 share
Big datacamp2013 shareBig datacamp2013 share
Big datacamp2013 share
 
Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013Hadoop, Oracle and the big data revolution collaborate 2013
Hadoop, Oracle and the big data revolution collaborate 2013
 
Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data Hadoop, oracle and the industrial revolution of data
Hadoop, oracle and the industrial revolution of data
 
Making the most of ssd in oracle11g
Making the most of ssd in oracle11gMaking the most of ssd in oracle11g
Making the most of ssd in oracle11g
 
Oracle sql high performance tuning
Oracle sql high performance tuningOracle sql high performance tuning
Oracle sql high performance tuning
 
Next generation databases july2010
Next generation databases july2010Next generation databases july2010
Next generation databases july2010
 
Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)Optimize oracle on VMware (April 2011)
Optimize oracle on VMware (April 2011)
 
Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014Optimizing Oracle databases with SSD - April 2014
Optimizing Oracle databases with SSD - April 2014
 
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
Understanding Solid State Disk and the Oracle Database Flash Cache (older ver...
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance Plsql
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)Optimize Oracle On VMware (Sep 2011)
Optimize Oracle On VMware (Sep 2011)
 
Thanks for the Memory
Thanks for the MemoryThanks for the Memory
Thanks for the Memory
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
How I learned to stop worrying and love Oracle
How I learned to stop worrying and love OracleHow I learned to stop worrying and love Oracle
How I learned to stop worrying and love Oracle
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
High Performance Plsql
High Performance PlsqlHigh Performance Plsql
High Performance Plsql
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

From oracle to hadoop with Sqoop and other tools

  • 1. From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools Guy Harrison, David Robson, Kate Ting {guy.harrison, david.robson}@software.dell.com, kate@cloudera.com October 16, 2014
  • 2. About Guy, David, & Kate Guy Harrison @guyharrison - Executive Director of R&D @ Dell - Author of Oracle Performance Survival Guide & MySQL Stored Procedure Programming David Robson @DavidR021 - Principal Technologist @ Dell - Sqoop Committer, Lead on Toad for Hadoop & OraOop Kate Ting @kate_ting - Technical Account Mgr @ Cloudera - Sqoop Committer/PMC, Co-author of Apache Sqoop Cookbook
  • 3.
  • 4.
  • 5.
  • 6. RDBMS and Hadoop  The relational database reigned supreme for more than two decades  Hadoop and other non-relational tools have overthrown that hegemony  We are unlikely to return to a “one size fits all” model based on Hadoop - Though some will try   For the foreseeable future, enterprise information architectures will include relational and non-relational stores
  • 7. Scenarios 1. We need to access RDBMS to make sense of Hadoop data Analytic output YARN/ MR1 HDFS Weblogs Products RDBMS Flume SQOOP
  • 8. Scenarios 1. Reference data is in the RDBMS 2. We want to run analysis outside of the RDBMS Analytic output HDFS Products RDBMS SQOOP YARN/ MR1 Sales SQOOP
  • 9. Scenarios 1. Reference data is in the RDBMS 2. We want to run analysis outside of the RDBMS 3. Feeding YARN/MR output into RDBMS Analytic output HDFS Weblogs Weblog Summary RDBMS Flume SQOOP YARN/ MR1
  • 10. Scenarios 1. We need to access RDBMS to make sense of Hadoop data 2. We want to use Hadoop to analyse RDBMS data 3. Hadoop output belongs in RDBMS Data warehouse 4. We archive old RDBMS data to Hadoop HDFS BI platform Sales RDBMS SQOOP HQL Old Sales SQL
  • 11. SQOOP  SQOOP was created in 2009 by Aaron Kimball as a means of moving data between SQL databases and Hadoop  It provided a generic implementation for moving data  It also provided a framework for implementing database specific optimized connectors
  • 12. How SQOOP works (import) Hive Table HDFS Table Metadata Table Data RDBMS Hive DDL Table.java SQOOP Map Task FileOutputFormat DataDrivenDBInputFormat Map Task DataDrivenDBInputFormat FileOutputFormat HDFS files
  • 14. SQOOP issues with Oracle  SQOOP uses primary key ranges to divide up data between mappers  However, the deletes hit older key values harder, making key ranges unbalanced.  Data is almost never arranged on disk in key order so index scans collide on disk  Load is unbalanced, and IO block requests >> blocks in the table. ORACLE TABLE on DISK ID > 0 and ID < MAX/2 MAPPER ORACLE SESSION RANGE SCAN Index block Index block ID > MAX/2 MAPPER ORACLE SESSION RANGE SCAN Index block Index block Index block Index block
  • 15. Other problems  Oracle might run each mapper using a full scan – clobbering the database  Oracle might run each mapper in parallel – clobbering the database  Sqoop may clobber the database cache 1800 1600 1400 1200 1000 800 600 400 200 0 0 2 4 6 8 10 12 14 16 18 Elasped time (s) 7000 6000 5000 4000 3000 2000 1000 Database load 0 Number of mappers 0 4 8 12 16 20 24 Database Time (s) Number of mappers
  • 16. High speed connector design  Partition data based on physical storage  By-pass Oracle buffering  By-pass Oracle parallelism  Do not require or use indexes  Never read the same data block more than once  Support Oracle datatypes ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 17. Imports (Oracle->Hadoop)  Uses Oracle block/extent map to equally divide IO  Uses Oracle direct path (non-buffered) IO for all reads  Round-robin, sequential or random allocation  All mappers get an equal number of blocks & no block is read twice  If table is partitioned, each mapper can work on a separate partition – results in partitioned output ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 18. Exports (Hadoop-> Oracle)  Optionally leverages Oracle partitions and temporary tables for parallel writes  Performs MERGE into Oracle table (Updates existing rows, inserts new rows)  Optionally use oracle NOLOGGING (faster but unrecoverable) ORACLE TABLE HDFS HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION HADOOP MAPPER ORACLE SESSION
  • 19. Import – Oracle to Hadoop  When data is unclustered (randomly distributed by PK), old SQOOP scales poorly  Clustered data shows better scalability but is still much slower than the direct approach.  New SQOOP outperforms 5-20 times typically  We’ve seen limiting factor as: - Data IO bandwidth, or - Network out of DB, or - Hadoop CPU 1600 1400 1200 1000 800 600 400 200 0 0 5 10 15 20 25 30 35 Elapsed time (s) Number of mappers direct=false - unclustered Data direct=false clustered data direct=true
  • 20. Import - Database overhead  As you increase mappers in old sqoop, database load increases rapidly - (sometimes non-linear)  In new Sqoop, queuing occurs only after IO bandwidth is exceeded 3000 2500 2000 1500 1000 500 0 0 4 8 12 16 20 24 DB time (minutes) Number of mappers Sqoop Direct
  • 21. Export – Oracle to Hadoop  On Export, old SQOOP would hit database writer bottleneck early on and fail to parallelize.  New SQOOP uses partitioning and direct path inserts.  Typically bottlenecks on write IO on Oracle side 120 100 80 60 40 20 0 0 4 8 12 16 20 24 Elapsed time (minutes) Number of mappers Sqoop Direct
  • 22. Reduction in database load  45% reduction in DB CPU  83% reduction in elapsed time  90% reduction in total database time  99.9% reduction in database IO 8 node Hadoop cluster, 1B rows, 310GB 55.31 83.45 90.59 99.98 99.28 0 20 40 60 80 100 IO time IO requests DB time Elapsed time CPU time % reduction
  • 23. Replication  No matter how fast we make SQOOP, it’s a drag to have to run a SQOOP job before every Hadoop job.  Replicating data into Hadoop cuts down on SQOOP overhead on both sides and avoids stale data. Shareplex® for Oracle and Hadoop
  • 24. Sqoop 1.4.5 Summary Sqoop 1.4.5 without –direct Sqoop 1.4.5 with --direct Minimal privileges required Access to DBA views required Works on most object types: e.g. IOT 5x-20x faster performance on tables Favors Sqoop terminology Favors Oracle terminology Database load increases non-linearly Up to 99% reduction in database IO
  • 26. Sqoop 1 Import Architecture sqoop import --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table cities
  • 27. Sqoop 1 Export Architecture sqoop export --connect jdbc:mysql://mysql.example.com/sqoop --username sqoop --password sqoop --table cities --export-dir /temp/cities
  • 28. Sqoop 1 Challenges  Concerns with usability - Cryptic, contextual command line arguments  Concerns with security - Client access to Hadoop bin/config, DB  Concerns with extensibility - Connectors tightly coupled with data format
  • 29. Sqoop 2 Design Goals  Ease of use - REST API and Java API  Ease of security - Separation of responsibilities  Ease of extensibility - Connector SDK, focus on pluggability
  • 30. Ease of Use Sqoop 1 Sqoop 2 sqoop import - Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura ndom“ -Ddfs.replication=1 -Dmapred.map.tasks.speculative.execution=false --num-mappers 4 --hive-import --hive-table CUSTOMERS --create-hive-table --connect jdbc:oracle:thin:@//localhost:1521/g12c --username OPSG --password opsg --table OPSG.CUSTOMERS --target-dir CUSTOMERS.CUSTOMERS
  • 31. Ease of Security Sqoop 1 Sqoop 2 sqoop import - Dmapred.child.java.opts="Djava.security.egd=file:///dev/ura ndom“ -Ddfs.replication=1 -Dmapred.map.tasks.speculative.execution=false --num-mappers 4 --hive-import --hive-table CUSTOMERS --create-hive-table --connect jdbc:oracle:thin:@//localhost:1521/g12c --username OPSG --password opsg --table OPSG.CUSTOMERS --target-dir CUSTOMERS.CUSTOMERS • Role-based access to connection objects • Prevents misuse and abuse • Administrators create, edit, delete • Operators use
  • 32. Ease of Extensibility Sqoop 1 Sqoop 2 Tight Coupling • Connectors fetch and store data from db • Framework handles serialization, format conversion, integration
  • 33. Takeaway  Apache Sqoop - Bulk data transfer tool between external structured datastores and Hadoop  Sqoop 1.4.5 now with a --direct parameter option for Oracle - 5x-20x performance improvement on Oracle table imports  Sqoop 2 - Ease of use, security, extensibility
  • 34. Questions? Guy Harrison @guyharrison David Robson @DavidR021 Kate Ting @kate_ting Visit Dell at Booth #102 Visit Cloudera at Booth #305 Book Signing: Today @ 3:15pm Office Hours: Tomorrow @ 11am

Hinweis der Redaktion

  1. When you think about Dell you probably think about laptops
  2. Or servers that might run databases or a Hadoop cluster, but you probably don't think of Dell as having expertise in either Oracle or Hadoop
  3. But actually Dell now has a billion-dollar software arm which includes the world's number one independent database tool – toad – used by millions of users and supporting almost every data platform
  4. Guy to improve diagram