Familiar with scoop advanced functions like import with append and last modified mode.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
2. Getting data from Mysql Database: import
$sqoop import --connect jdbc:mysql://localhost/db_1
--username root --password root --table student_details --split-
by ID --target-dir studentdata;
$hadoop fs -ls studentdata/
Now we can see multiple part-m files in the folder. This is
because sqoop uses multiple map tasks to process the output
and each mapper gives a subset of rows and by default
sqoop uses 4 mappers i.e. the output is divided into number
of mappers.
Use CAT command to view the content of the each mapper
outputs: part-m-00000 inside u will see row data such as abc 12 TX
part-m-00001 inside âno data---
part-m-00002 inside u will see row data such as ecg 56 FL
part-m-00003 inside --- no data---
Rupak Roy
3. ï The reason behind this is we are not using Primary key
for splitting and results in unbalance tasks where some
mapper process more data than other.
Sqoop by default uses Primary key of the table for
splitting the column.
Alternatively, we can address this issue by explicitly
declaring which column will be used for splitting the rows
among the mappers.
#explicitly declaring
Sqoop import --connect jdbc:mysql://localhost/db_1 --
username root --password root --table student_details --
Target-dir studentdata --split-by ID;
Rupak Roy
4. Now using Primary key
#adding primary key to our table in the database
Mysql> ALTER TABLE student_details ;
ADD PRIMARY KEY (ID);
#now use the same query to load the same data from mysql database to the hdfs.
$sqoop import --connect
jdbc:mysql://localhost/db_1 --u root âp root - - table student_details âtarget-dir
student_details1.
Note: it will through an error if our target directory already have the same student_details folder.
#check the data
Hadoop fs âls /user/hduser/student_details1
Hadoop fs âcat /user/hduser/student_details1/part-m-00000
#access the database and table list
$ sqoop list-databases --connect jdbc:/mysql://localhost/ --username root --
password root
$ sqoop list-tables âconnect jdbc:/mysql://localhost/db_1 --username root --
password root
Rupak Roy
5. Controlling Parallelism
We know sqoop by default uses map tasks to
process its job.
However scoop also provides flexibility to
change the number of map tasks depending on
our job requirements.
ï With the flexibility of controlling the amount of
map tasks i.e. the parallel processing. helps to
control the load in our database.
ï More mappers doesnât always mean faster
performance. The optimal number depends
on the type of database, hardware of the
nodes(systems) or the amount of job requests.
Rupak Roy
7. Now if we want to import only the
updated data
ï This can be done by using 2 modes:
1) Append Mode
2) Last â Modified Mode
1)Append Mode:
First add new rows
Mysql> use tb_1;
> insert into student_details(ID,Name,Location) values (44,âalbertâ,âCAâ)
> insert into student_details(ID,Name,Location) values (55,âZaynâ,âMIâ)
#note: we need integer data type to detect the last value in append mode
ALTER TABLE student_details
MODIFY ID int(30);
Then import
$sqoop import --connect jdbc:mysql://locahost/db_1 --username root --password root --
table student_details âsplit-by ID --incremental append --check-column ID --last-value 33
âtarget-dir Appendresults/
Where as, type: incremental mode: append, --check-column: checks the column
Therefore Append Mode is used only when the table is populated with new rows.
Rupak Roy
8. Now if we want to import only the
updated data
2) Last- Modified Mode: is used to overcome the
limitations of Append Modeâs row and column
updates. Hence is it suitable when the table is
populated with new rows and columns.
Each time whenever the table gets updated, the
last modified mode will use the recent time-stamp
attached with each updates to import only the
new modified data rows and columns in hdfs.
Rupak Roy
9. Add
#add a timestamp column
Mysql> ALTER TABLE student_details
ADD Column updated_at
TIMESTAMP DEFAULT CURRENT_TIMESTAMP
ON UPDATE CURRENT_TIMESTAMP;
#add a new column
Mysql> ALTER TABLE student_details
ADD Column YEAR char(10)
AFTER Location;
#addValues to the new columns
MySql> insert into student_details(YEAR)
values(2010)
OR
Mysql> UPDATE student_details
SET Year = 2010
where Location = âFLâ;
âŠ.repeat again for the 2 rows
Rupak Roy
10. #then import
$sqoop import --connect jdbc:mysql://localhost/db_1 -u root âp
root --table student_details --split-by ID --incremental lastmodified
--checkcolumn updated_at --last-Value â2017-01-15 13:00:28â
--target-dir lmresults/
Where as type: incremental mode: lastmodified,
--checkcolumn updated_at: will check the
timestamp column with --last-Value â2017-01-15
13:00:28â
Rupak Roy
11. Append Mode vs Last Modified
mode
ï Both append and last modified mode sets apart with
their unique advantages over each others limitations.
In Append mode you donât have to delete the existing
output folder in HDFS, it will create an another file and
renames by itself sequentially.
But in Last Modified Mode sqoop needs the existing
output HDFS folder to be empty .
Also
In append mode it will import the data from the last-
value described but in Last Modified mode it will take all
the newly modified rows & columns into account.
Rupak Roy
12. Next
ï In real life it might not be efficient and
practical to remember the last value each
time we run sqoop. To overcome this issue we
have an another function call sqoop job.
Rupak Roy