SlideShare ist ein Scribd-Unternehmen logo
1 von 29
sqoop
Easy, parallel database import/export



Aaron Kimball
Cloudera Inc.
June 3, 2010
Sqoop is…


        … a suite of tools that connect Hadoop
        and database systems.



▪   Import tables from databases into HDFS for deep analysis
▪   Replicate database schemas in Hive’s metastore
▪   Export MapReduce results back to a database for presentation to
    end-users
In this talk…
▪   How Sqoop works
▪   Working with imported data
▪   Sqoop 1.0 Release
The problem
Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS




                Where’s the bridge?
Sqoop = SQL-to-Hadoop
▪   Easy import of data from many databases to HDFS
▪   Generates code for use in MapReduce applications
▪   Integrates with Hive




                     sqoop!       Hadoop
          mysql
                                                          Custom
                                                        MapReduce
                                                         programs
                                                      reinterpret data


                              ...Via auto-generated
                               datatype definitions
Example data pipeline
Key features of Sqoop
▪   JDBC-based implementation
    ▪   Works with many popular database vendors
▪   Auto-generation of tedious user-side code
    ▪   Write MapReduce applications to work with your data, faster
▪   Integration with Hive
    ▪   Allows you to stay in a SQL-based environment
▪   Extensible backend
    ▪   Database-specific code paths for better performance
Example input
mysql> use corp;
Database changed

mysql> describe employees;
+------------+-------------+------+-----+---------+----------------+
| Field      | Type        | Null | Key | Default | Extra          |
+------------+-------------+------+-----+---------+----------------+
| id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| firstname | varchar(32) | YES |       | NULL    |                |
| lastname   | varchar(32) | YES |      | NULL    |                |
| jobtitle   | varchar(64) | YES |      | NULL    |                |
| start_date | date        | YES |      | NULL    |                |
| dept_id    | int(11)     | YES |      | NULL    |                |
+------------+-------------+------+-----+---------+----------------+
Loading into HDFS

$ sqoop import 
           --connect jdbc:mysql://db.foo.com/corp 
           --table employees




▪   Imports “employees” table into HDFS directory
    ▪   Data imported as text or SequenceFiles
    ▪   Optionally compress and split data during import
▪   Generates employees.java for your use
Example output

$ hadoop fs –cat employees/part-00000
0,Aaron,Kimball,engineer,2008-10-01,3
1,John,Doe,manager,2009-01-14,6




▪   Files can be used as input to MapReduce processing
Auto-generated class

public class employees {
  public Integer get_id();
  public String get_firstname();
  public String get_lastname();
  public String get_jobtitle();
  public java.sql.Date get_start_date();
  public Integer get_dept_id();
  // parse() methods that understand text
  // and serialization methods for Hadoop
}
Hive integration
 Imports table definition into Hive after data is imported to HDFS
Export back to a database
mysql> CREATE TABLE ads_results (
       id INT NOT NULL PRIMARY KEY,
       page VARCHAR(256),
       score DOUBLE);


$ sqoop export 
         --connect jdbc:mysql://db.foo.com/corp 
         --table ads_results       --export-dir results




▪   Exports “results” dir into “ads_results” table
Additional options
▪   Multiple data representations supported
    ▪   TextFile – ubiquitous; easy import into Hive
    ▪   SequenceFile – for binary data; better compression support,
        higher performance
▪   Supports local and remote Hadoop clusters, databases
▪   Can select a subset of columns, specify a WHERE clause
▪   Controls for delimiters and quote characters:
    ▪   --fields-terminated-by, --lines-terminated-by,
        --optionally-enclosed-by, etc.
    ▪   Also supports delimiter conversion
        (--input-fields-terminated-by, etc.)
Under the hood…
▪   JDBC
    ▪   Allows Java applications to submit SQL queries to databases
    ▪   Provides metadata about databases (column names, types, etc.)
▪   Hadoop
    ▪   Allows input from arbitrary sources via different InputFormats
    ▪   Provides multiple JDBC-based InputFormats to read from
        databases
    ▪   Can write to arbitrary sinks via OutputFormats – Sqoop includes
        a high-performance database export OutputFormat
InputFormat woes
▪   DBInputFormat allows database records to be used as mapper
    inputs
▪   The trouble with using DBInputFormat directly is:
    ▪   Connecting an entire Hadoop cluster to a database is a
        performance nightmare
    ▪   Databases have lower read bandwidth than HDFS; for repeated
        analyses, much better to make a copy in HDFS first
    ▪   Users must write a class that describes a record from each table
        they want to import or work with (a “DBWritable”)
DBWritable example
1.  class MyRecord implements Writable, DBWritable {
2.    long msg_id;
3.    String msg;
4.    public void readFields(ResultSet resultSet)
5.        throws SQLException {
6.      this.msg_id = resultSet.getLong(1);
7.      this.msg = resultSet.getString(2);
8.    }
9.    public void readFields(DataInput in) throws
10.       IOException {
11.     this.msg_id = in.readLong();
12.     this.msg = in.readUTF();
13.   }
14. }
DBWritable example
1.  class MyRecord implements Writable, DBWritable {
2.    long msg_id;
3.    String msg;
4.    public void readFields(ResultSet resultSet)
5.        throws SQLException {
6.      this.msg_id = resultSet.getLong(1);
7.      this.msg = resultSet.getString(2);
8.    }
9.    public void readFields(DataInput in) throws
10.       IOException {
11.     this.msg_id = in.readLong();
12.     this.msg = in.readUTF();
13.   }
14. }
A direct typeJDBC Type
               mapping                         Java Type
        CHAR                         String
        VARCHAR                      String
        LONGVARCHAR                  String
        NUMERIC                      java.math.BigDecimal
        DECIMAL                      java.math.BigDecimal
        BIT                          boolean
        TINYINT                      byte
        SMALLINT                     short
        INTEGER                      int
        BIGINT                       long
        REAL                         float
        FLOAT                        double
        DOUBLE                       double
        BINARY                       byte[]
        VARBINARY                    byte[]
        LONGVARBINARY                byte[]
        DATE                         java.sql.Date
        TIME                         java.sql.Time
        TIMESTAMP                    java.sql.Timestamp
        http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
Class auto-generation
Working with Sqoop
▪   Basic workflow:
    ▪   Import initial table with Sqoop
    ▪   Use auto-generated table class in MapReduce analyses
    ▪   … Or write Hive queries over imported tables
    ▪   Perform periodic re-imports to ingest new data
    ▪   Use Sqoop to export results back to databases for online access


▪   Table classes can parse records from delimited files in HDFS
Processing records in MapReduce
1.   void map(LongWritable k, Text v, Context c) {
2.       MyRecord r = new MyRecord();
3.       r.parse(v); // auto-generated text parser
4.       process(r.get_msg()); // your logic here
5.       ...
6.   }
Import parallelism
▪   Sqoop uses indexed columns to divide a table into ranges
    ▪   Based on min/max values of the primary key
    ▪   Allows databases to use index range scans
    ▪   Several worker tasks import a subset of the table each
▪   MapReduce is used to manage worker tasks
    ▪   Provides fault tolerance
    ▪   Workers write to separate HDFS nodes; wide write bandwidth
Parallel exports
▪   Results from MapReduce processing
    stored in delimited text files


▪   Sqoop can parse these files, and
    insert the results in a database table
Direct-mode imports and exports
▪   MySQL provides mysqldump for high-performance table output
    ▪   Sqoop special-cases jdbc:mysql:// for faster loading
    ▪   With MapReduce, think “distributed mk-parallel-dump”


▪   Similar mechanism used for PostgreSQL
▪   Avoids JDBC overhead


▪   On the other side…
    ▪   mysqlimport provides really fast Sqoop exports
    ▪   Writers stream data into mysqlimport via named FIFOs
Recent Developments
▪   April 2010: Sqoop moves to github
▪   May 2010: Preparing for 1.0 release
    ▪   Higher-performance pipelined export process
    ▪   Improved support for storage of large data (CLOBs, BLOBs)
    ▪   Refactored API, improved documentation
    ▪   Better platform compatibility:
        ▪   Will work with to-be-released Apache Hadoop 0.21
        ▪   Will work with Cloudera’s Distribution for Hadoop 3
▪   June 2010: Planned 1.0.0 release (in time for Hadoop Summit)


▪   Plenty of bugs to fix and features to add – see me if you want to
    help!
Conclusions
▪   Most database import/export tasks are “turning the crank”
▪   Sqoop can automate a lot of this
    ▪   Allows more efficient use of existing data sources in concert with
        new, complex data
▪   Available as part of Cloudera’s Distribution for Hadoop


                     The pitch: www.cloudera.com/sqoop
                    The code: github.com/cloudera/sqoop


                            aaron@cloudera.com
(c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 

Was ist angesagt? (20)

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 

Ähnlich wie Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopContinuent
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 

Ähnlich wie Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 

Mehr von Skills Matter

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard LawrenceSkills Matter
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applicationsSkills Matter
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimSkills Matter
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Skills Matter
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingSkills Matter
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveSkills Matter
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tSkills Matter
 

Mehr von Skills Matter (20)

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberl
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.js
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source world
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testing
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
 
Serendipity-neo4j
Serendipity-neo4jSerendipity-neo4j
Serendipity-neo4j
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
 
Lug presentation
Lug presentationLug presentation
Lug presentation
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
 
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
 

Kürzlich hochgeladen

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

  • 1.
  • 2. sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 3, 2010
  • 3. Sqoop is… … a suite of tools that connect Hadoop and database systems. ▪ Import tables from databases into HDFS for deep analysis ▪ Replicate database schemas in Hive’s metastore ▪ Export MapReduce results back to a database for presentation to end-users
  • 4. In this talk… ▪ How Sqoop works ▪ Working with imported data ▪ Sqoop 1.0 Release
  • 5. The problem Structured data in traditional databases cannot be easily combined with complex data stored in HDFS Where’s the bridge?
  • 6. Sqoop = SQL-to-Hadoop ▪ Easy import of data from many databases to HDFS ▪ Generates code for use in MapReduce applications ▪ Integrates with Hive sqoop! Hadoop mysql Custom MapReduce programs reinterpret data ...Via auto-generated datatype definitions
  • 8. Key features of Sqoop ▪ JDBC-based implementation ▪ Works with many popular database vendors ▪ Auto-generation of tedious user-side code ▪ Write MapReduce applications to work with your data, faster ▪ Integration with Hive ▪ Allows you to stay in a SQL-based environment ▪ Extensible backend ▪ Database-specific code paths for better performance
  • 9. Example input mysql> use corp; Database changed mysql> describe employees; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+
  • 10. Loading into HDFS $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table employees ▪ Imports “employees” table into HDFS directory ▪ Data imported as text or SequenceFiles ▪ Optionally compress and split data during import ▪ Generates employees.java for your use
  • 11. Example output $ hadoop fs –cat employees/part-00000 0,Aaron,Kimball,engineer,2008-10-01,3 1,John,Doe,manager,2009-01-14,6 ▪ Files can be used as input to MapReduce processing
  • 12. Auto-generated class public class employees { public Integer get_id(); public String get_firstname(); public String get_lastname(); public String get_jobtitle(); public java.sql.Date get_start_date(); public Integer get_dept_id(); // parse() methods that understand text // and serialization methods for Hadoop }
  • 13. Hive integration Imports table definition into Hive after data is imported to HDFS
  • 14. Export back to a database mysql> CREATE TABLE ads_results ( id INT NOT NULL PRIMARY KEY, page VARCHAR(256), score DOUBLE); $ sqoop export --connect jdbc:mysql://db.foo.com/corp --table ads_results --export-dir results ▪ Exports “results” dir into “ads_results” table
  • 15. Additional options ▪ Multiple data representations supported ▪ TextFile – ubiquitous; easy import into Hive ▪ SequenceFile – for binary data; better compression support, higher performance ▪ Supports local and remote Hadoop clusters, databases ▪ Can select a subset of columns, specify a WHERE clause ▪ Controls for delimiters and quote characters: ▪ --fields-terminated-by, --lines-terminated-by, --optionally-enclosed-by, etc. ▪ Also supports delimiter conversion (--input-fields-terminated-by, etc.)
  • 16. Under the hood… ▪ JDBC ▪ Allows Java applications to submit SQL queries to databases ▪ Provides metadata about databases (column names, types, etc.) ▪ Hadoop ▪ Allows input from arbitrary sources via different InputFormats ▪ Provides multiple JDBC-based InputFormats to read from databases ▪ Can write to arbitrary sinks via OutputFormats – Sqoop includes a high-performance database export OutputFormat
  • 17. InputFormat woes ▪ DBInputFormat allows database records to be used as mapper inputs ▪ The trouble with using DBInputFormat directly is: ▪ Connecting an entire Hadoop cluster to a database is a performance nightmare ▪ Databases have lower read bandwidth than HDFS; for repeated analyses, much better to make a copy in HDFS first ▪ Users must write a class that describes a record from each table they want to import or work with (a “DBWritable”)
  • 18. DBWritable example 1. class MyRecord implements Writable, DBWritable { 2. long msg_id; 3. String msg; 4. public void readFields(ResultSet resultSet) 5. throws SQLException { 6. this.msg_id = resultSet.getLong(1); 7. this.msg = resultSet.getString(2); 8. } 9. public void readFields(DataInput in) throws 10. IOException { 11. this.msg_id = in.readLong(); 12. this.msg = in.readUTF(); 13. } 14. }
  • 19. DBWritable example 1. class MyRecord implements Writable, DBWritable { 2. long msg_id; 3. String msg; 4. public void readFields(ResultSet resultSet) 5. throws SQLException { 6. this.msg_id = resultSet.getLong(1); 7. this.msg = resultSet.getString(2); 8. } 9. public void readFields(DataInput in) throws 10. IOException { 11. this.msg_id = in.readLong(); 12. this.msg = in.readUTF(); 13. } 14. }
  • 20. A direct typeJDBC Type mapping Java Type CHAR String VARCHAR String LONGVARCHAR String NUMERIC java.math.BigDecimal DECIMAL java.math.BigDecimal BIT boolean TINYINT byte SMALLINT short INTEGER int BIGINT long REAL float FLOAT double DOUBLE double BINARY byte[] VARBINARY byte[] LONGVARBINARY byte[] DATE java.sql.Date TIME java.sql.Time TIMESTAMP java.sql.Timestamp http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
  • 22. Working with Sqoop ▪ Basic workflow: ▪ Import initial table with Sqoop ▪ Use auto-generated table class in MapReduce analyses ▪ … Or write Hive queries over imported tables ▪ Perform periodic re-imports to ingest new data ▪ Use Sqoop to export results back to databases for online access ▪ Table classes can parse records from delimited files in HDFS
  • 23. Processing records in MapReduce 1. void map(LongWritable k, Text v, Context c) { 2. MyRecord r = new MyRecord(); 3. r.parse(v); // auto-generated text parser 4. process(r.get_msg()); // your logic here 5. ... 6. }
  • 24. Import parallelism ▪ Sqoop uses indexed columns to divide a table into ranges ▪ Based on min/max values of the primary key ▪ Allows databases to use index range scans ▪ Several worker tasks import a subset of the table each ▪ MapReduce is used to manage worker tasks ▪ Provides fault tolerance ▪ Workers write to separate HDFS nodes; wide write bandwidth
  • 25. Parallel exports ▪ Results from MapReduce processing stored in delimited text files ▪ Sqoop can parse these files, and insert the results in a database table
  • 26. Direct-mode imports and exports ▪ MySQL provides mysqldump for high-performance table output ▪ Sqoop special-cases jdbc:mysql:// for faster loading ▪ With MapReduce, think “distributed mk-parallel-dump” ▪ Similar mechanism used for PostgreSQL ▪ Avoids JDBC overhead ▪ On the other side… ▪ mysqlimport provides really fast Sqoop exports ▪ Writers stream data into mysqlimport via named FIFOs
  • 27. Recent Developments ▪ April 2010: Sqoop moves to github ▪ May 2010: Preparing for 1.0 release ▪ Higher-performance pipelined export process ▪ Improved support for storage of large data (CLOBs, BLOBs) ▪ Refactored API, improved documentation ▪ Better platform compatibility: ▪ Will work with to-be-released Apache Hadoop 0.21 ▪ Will work with Cloudera’s Distribution for Hadoop 3 ▪ June 2010: Planned 1.0.0 release (in time for Hadoop Summit) ▪ Plenty of bugs to fix and features to add – see me if you want to help!
  • 28. Conclusions ▪ Most database import/export tasks are “turning the crank” ▪ Sqoop can automate a lot of this ▪ Allows more efficient use of existing data sources in concert with new, complex data ▪ Available as part of Cloudera’s Distribution for Hadoop The pitch: www.cloudera.com/sqoop The code: github.com/cloudera/sqoop aaron@cloudera.com
  • 29. (c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0