How to Troubleshoot Apps for the Modern Connected Worker
Hadoop and rdbms with sqoop
1. Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP Guy Harrison Director, R&D Melbourne www.guyharrison.net Guy.harrison@quest.com @guyharrison
9. Options for RDBMS inter-op DBInputFormat: Allows database records to be used as mapper inputs. BUT: Not inherently scalable or efficient For repeated analysis, better to stage in Hadoop Tedious coding of DBWritable classes for each table SQOOP Open source utility provided by Cloudera Configurable command line interface to copy RDBMS->HDFS Support for Hive, Hbase Generates java classes for future M-R tasks Extensible to provide optimized adaptors for specific targets Bi-Directional
10. SQOOP Details SQOOP import Divide table into ranges using primary key max/min Create mappers for each range Mappers write to multiple HDFS nodes Creates text or sequence files Generates Java class for resulting HDFS file Generates Hive definition and auto-loads into HIVE SQOOP export Read files in HDFS directory via MapReduce Bulk parallel insert into database table
11. SQOOP details SQOOP features: Compatible with almost any JDBC enabled database Auto load into HIVE Hbase support Special handling for database LOBs Job management Cluster configuration (jar file distribution) WHERE clause support Open source, and included in Cloudera distributions SQOOP fast paths & plug ins Invokemysqldump, mysqlimport for MySQL jobs Similar fast paths for PostgreSQL Extensibility architecture for 3rd parties (like Quest ) Teradata, Netezza, etc.
12. Working with Oracle SQOOP approach is generic and applicable to all RDBMS However for Oracle, sub-optimal in some respects: Oracle may parallelize and serialize individual mappers Oracle optimizer may decline to use index range scans Oracle physical storage often deliberately not in primary key order (reverse key indexes, hash partitioning, etc) Primary keys often not be evenly distributed Index range scans use single block random reads vs.faster multi-block table scans Index range scans load into Oracle buffer cache Pollutes cache increasing IO for other users Limited help to SQOOP since rows are only read once Luckily, SQOOP extensibility allows us to add optimizations for specific targets
16. Index range scans Hadoop Mapper ID > 0 and ID < MAX/2 Hadoop Mapper ID > MAX/2 Oracle Session Oracle Session Index range scan Buffer cache Index range scan Index block Index block Index block Index block Index block Index block Oracle table
18. Quest/Cloudera OraOop for SQOOP Design goals Partition data based on physical storage By-pass Oracle buffering By-pass Oracle parallelism Do not require or use indexes Never read the same data block more than once Support esoteric datatypes (eventually) Support RAC clusters Availability: Freely available from www.quest.com/ora-oop Packaged with Cloudera Enterprise Commercial support from Quest/Cloudera within Enterprise distribution
21. Extending SQOOP SQOOP lets you concentrate on the RDBMS logic, not the Hadoop plumbing: Extend ManagerFactory (what to handle) Extend ConnManager (DB connection and metadata) For imports: Extend DataDrivenDBInputFormat (gets the data) Data allocation (getSplits()) Split serialization (“io.serializations” property) Data access logic (createDBRecordReader(), getSelectQuery()) Implement progress (nextKeyValue(), getProgress()) Similar procedure for extending exports
22. SQOOP/OraOop best practices Use sequence files for LOBs OR Set inline-lob-limit Directly control datanodes for widest destination bandwidth Can’t rely on mapred.max.maps.per.node Set number of mappers realistically Disable speculative execution (our default) Leads to duplicate DB reads Set Oracle row fetch size extra high Keeps the mappers streaming to HDFS
23. Conclusion RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop SQOOP extensions can provide optimizations for specific targets Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value Try out OraOop for SQOOP!