SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Downloaden Sie, um offline zu lesen
Lesson 2 : Hadoop & NoSQL Data Loading 
using Hadoop Tools and ODI12c 
Mark Rittman, CTO, Rittman Mead 
SIOUG and HROUG Conferences, Oct 2014 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Moving Data In, Around and Out of Hadoop 
•Three stages to Hadoop data movement, with dedicated Apache / other tools 
‣Load : receive files in batch, or in real-time (logs, events) 
‣Transform : process & transform data to answer questions 
‣Store / Export : store in structured form, or export to RDBMS using Sqoop 
RDBMS 
Imports 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Loading 
Stage 
!!!! 
Processing 
Stage 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
!!!! 
Store / Export 
Stage 
!!!! 
Real-Time 
Logs / Events 
File / 
Unstructured 
Imports 
File 
Exports 
RDBMS 
Exports
Lesson 2 : Hadoop Data Loading 
Hadoop Data Loading / Ingestion Fundamentals 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Core Apache Hadoop Tools 
•Apache Hadoop, including MapReduce and HDFS 
‣Scaleable, fault-tolerant file storage for Hadoop 
‣Parallel programming framework for Hadoop 
•Apache Hive 
‣SQL abstraction layer over HDFS 
‣Perform set-based ETL within Hadoop 
•Apache Pig, Spark 
‣Dataflow-type languages over HDFS, Hive etc 
‣Extensible through UDFs, streaming etc 
•Apache Flume, Apache Sqoop, Apache Kafka 
‣Real-time and batch loading into HDFS 
‣Modular, fault-tolerant, wide source/target coverage 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Demo 
Hue with CDH5 on the Big Data Lite VM 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Other Tools Typically Used… 
•Python, Scala, Java and other programming languages 
‣For more complex and procedural transformations 
•Shell scripts, sed, awk, regexes etc 
•R and R-on-Hadoop 
‣Typically at the “discovery” phase 
•And down the line - ETL tools to automate the process 
‣ODI, Pentaho Data Integrator etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Data Loading into Hadoop 
•Default load type is real-time, streaming loads 
‣Batch / bulk loads only typically used to seed system 
•Variety of sources including web log activity, event streams 
•Target is typically HDFS (Hive) or HBase 
•Data typically lands in “raw state” 
‣Lots of files and events, need to be filtered/aggregated 
‣Typically semi-structured (JSON, logs etc) 
‣High volume, high velocity 
-Which is why we use Hadoop rather than 
RBDMS (speed vs. ACID trade-off) 
‣Economics of Hadoop means its often possible to 
archive all incoming data at detail level 
Loading 
Stage 
!!!! 
Real-Time 
Logs / Events 
File / 
Unstructured 
Imports
Apache Flume : Distributed Transport for Log Activity 
•Apache Flume is the standard way to transport log files from source through to target 
•Initial use-case was webserver log files, but can transport any file from A>B 
•Does not do data transformation, but can send to multiple targets / target types 
•Mechanisms and checks to ensure successful transport of entries 
•Has a concept of “agents”, “sinks” and “channels” 
•Agents collect and forward log data 
•Sinks store it in final destination 
•Channels store log data en-route 
•Simple configuration through INI files 
•Handled outside of ODI12c 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Apache Flume : Agents, Channels and Sinks 
•Multiple agents can be used to capture logs from many sources, combine into one output 
•Needs at least one source agent, and a target agent 
•Agents can be multi-step, handing-off data across the topology 
•Channels store data in files, or in RAM, as a buffer between steps 
•Log files being continuously written to have 
contents trickle-fed across to source 
•Sink types for Hive, HBase and many others 
•Free software, part of Hadoop platform 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Typical Flume Use Case : Copy Log Files to HDFS / Hive 
•Typical use of Flume is to copy log entries from servers onto Hadoop / HDFS 
•Tightly integrated with Hadoop framework 
•Mirror server log files into HDFS, aggregate logs from >1 server 
•Can aggregate, filter and transform 
incoming data before writing to HDFS 
•Alternatives to log file “tailing” - HTTP GET / PUSH etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Flume Source / Target Configuration 
•Conf file for source system agent 
•TCP port, channel size+type, source type 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
•Conf file for target system agent 
•TCP port, channel size+type, sink type 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Apache Kafka : Reliable, Message-Based 
•Developed by LinkedIn, designed to address Flume issues around reliability, throughput 
‣(though many of those issues have been addressed since) 
•Designed for persistent messages as the common use case 
‣Website messages, events etc vs. log file entries 
•Consumer (pull) rather than Producer (push) model 
•Supports multiple consumers per message queue 
•More complex to set up than Flume, and can use 
Flume as a consumer of messages 
‣But gaining popularity, especially 
alongside Spark Streaming 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
GoldenGate for Continuous Streaming to Hadoop 
•Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop 
•Leverages GoldenGate & HDFS / Hive Java APIs 
•Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive) 
•Likely to be formal part of GoldenGate in future release - but usable now 
•Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Bulk-Loading into Hadoop 
•Typically used for initial data load, or for one-off analysis 
•Aim for bulk-loading is to copy external dataset into HDFS 
‣From files (delimited, semi-structured, XML, JSON etc) 
‣From databases or other structured data stores 
•Main tools used for bulk-data loading include 
‣Hadoop FS Shell 
‣Sqoop 
Loading 
Stage 
!!!! 
RDBMS 
Imports 
File / 
Unstructured 
Imports
Hadoop FS Shell Commands 
•Follows typical Unix/Linux command naming 
! 
! 
! 
! 
! 
! 
•Additional commands for bulk data movement 
[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff 
[oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracle 
Found 5 items 
drwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.staging 
drwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemo 
drwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/moviework 
drwxrwxrwx - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuff 
drwxrwxrwx - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stage 
$ hadoop fs -copyFromLocal <local_dir> <hdfs_dir> 
$ hadoop fs -copyToLocal <hdfs_dir> <local_dir> 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Apache Sqoop : SQL to Hadoop 
•Apache top-level project, typically ships with most Hadoop distributions 
•Tool to transfer data from relational database systems 
‣Oracle, mySQL, PostgreSQL, Teradata etc 
•Loads into, and exports out of, Hadoop ecosystem 
‣Uses JDBC drivers to connect to RBDMS source/target 
‣Data transferred in/out of Hadoop 
using parallel Map-only Hadoop jobs 
-Sqoop introspects source / target RBDMS 
to determine structure, table metadata 
-Job tracker splits import / export into 
separate jobs, based on split column(s) 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Map 
Map 
Map 
Map 
HDFS Storage
Sqoop Command-Line Parameters 
sqoop import —connect jdbc:oracle:thin:@centraldb11gr2.rittmandev.com:1521/ 
ctrl11g.rittmandev.com —username blog_refdata —password password —query ‘SELECT 
p.post_id, c.cat_name from post_one_cat p, categories c where p.cat_id = c.cat_id 
and $CONDITIONS’ —target_dir /user/oracle/post_categories —hive-import —hive-overwrite 
—hive-table post_categories —split-by p.post_id 
•—username, —password : database account username and password 
•—query : SELECT statement to retrieve data (can use —table instead, for single table) 
•$CONDITIONS, —split-by : column by which MapReduce jobs can be run in parallel 
•—hive-import, —hive-overwrite, —hive-table : name and load mode for Hive table 
•— target_dir : target HDFS directory to land data in initially (required for SELECT) 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Data Storage and Formats within Hadoop Clusters 
•Data landing in Hadoop clusters typically is in raw, unprocessed form 
•May arrive as log files, XML files, JSON documents, machine / sensor data 
•Typically needs to go through a processing, filtering and aggregation phase to be useful 
•Final output of processing stage is usually structured files, or Hive tables 
RDBMS 
Imports 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Loading 
Stage 
!!!! 
Processing 
Stage 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
!!!! 
Store / Export 
Stage 
!!!! 
Real-Time 
Logs / Events 
File / 
Unstructured 
Imports 
File 
Exports 
RDBMS 
Exports
Initial Data Scoping & Discovery using R 
•R is typically used at start of a big data project to get a high-level understanding of the data 
•Can be run as R standalone, or using Oracle R Advanced Analytics for Hadoop 
•Do basic scan of incoming dataset, get counts, determine delimiters etc 
•Distribution of values for columns 
•Basic graphs and data discovery 
•Use findings to drive design of 
parsing logic, Hive data structures, 
need for data scrubbing / correcting etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Apache Hive : SQL Access + Table Metadata Over HDFS 
•Apache Hive provides a SQL layer over Hadoop, once we understand the structure (schema) 
of the data we’re working with 
•Exposes HDFS and other Hadoop data as tables and columns 
•Provides a simple SQL dialect for queries called HiveQL 
•SQL queries are turned into MapReduce jobs under-the-covers 
•JDBC and ODBC drivers provide 
access to BI and ETL tools 
•Hive metastore (data dictionary) 
leveraged by many other Hadoop tools 
‣Apache Pig 
‣Cloudera Impala 
‣etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
SELECT a, sum(b) 
FROM myTable 
WHERE a<100 
GROUP BY a 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Map 
Task 
Map 
Task 
Map 
Task 
Reduce 
Task 
Reduce 
Task 
Result
Demo 
Hive within Hue on the Big Data Lite VM 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Hive SerDes & Storage Handlers 
•Plug-in technologies that extend Hive to handle new data formats and semi-structured sources 
•Typically distributed as JAR files, hosted on sites such as GitHub 
•Can be used to parse log files, access data in NoSQL databases, Amazon S3 etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
CREATE EXTERNAL TABLE apachelog ( 
host STRING, 
identity STRING, 
user STRING, 
time STRING, 
request STRING, 
status STRING, 
size STRING, 
referer STRING, 
agent STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) 
([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") 
([^ "]*|"[^"]*"))?", 
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" 
) 
STORED AS TEXTFILE 
LOCATION '/user/root/logs'; 
CREATE TABLE tweet_data( 
interactionId string, 
username string, 
content string, 
author_followers int) 
ROW FORMAT SERDE 
'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
'mongo.columns.mapping'='{"interactionId":"interactionId", 
"username":"interaction.interaction.author.username", 
"content":"interaction.interaction.content", 
"author_followers_count":"interaction.twitter.user.followers_count"}' 
) 
TBLPROPERTIES ( 
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' 
)
Lesson 2 : Hadoop Data Loading 
Oracle’s Hadoop Data Loading Toolkit 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Oracle’s Big Data Products 
•Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing 
‣Cloudera Distribution of Hadoop 
‣Cloudera Manager 
‣Open-source R 
‣Oracle NoSQL Database 
‣Oracle Enterprise Linux + Oracle JVM 
‣New - Oracle Big Data SQL 
•Oracle Big Data Connectors 
‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS) 
‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS) 
‣Oracle R Advanced Analytics for Hadoop 
‣Oracle Data Integrator 12c 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Oracle Big Data Connectors 
•Oracle-licensed utilities to connect Hadoop to Oracle RBDMS 
‣Bulk-extract data from Hadoop to Oracle, or expose HDFS / Hive data as external tables 
‣Run R analysis and processing on Hadoop 
‣Leverage Hadoop compute resources to offload ETL and other work from Oracle RBDMS 
‣Enable Oracle SQL to access and load Hadoop data 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Oracle Loader for Hadoop 
•Oracle technology for accessing Hadoop data, and loading it into an Oracle database 
•Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce 
•Direct-path loads into Oracle Database, partitioned and non-partitioned 
•Online and offline loads 
•Key technology for fast load of 
Hadoop results into Oracle DB
Oracle Direct Connector for HDFS 
•Enables HDFS as a data-source for Oracle Database external tables 
•Effectively provides Oracle SQL access over HDFS 
•Supports data query, or import into Oracle DB 
•Treat HDFS-stored files in the same way as regular files 
‣But with HDFS’s low-cost 
‣… and fault-tolerance 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Oracle R Advanced Analytics for Hadoop 
•Add-in to R that extends capability to Hadoop 
•Gives R the ability to create Map and Reduce functions 
•Extends R data frames to include Hive tables 
‣Automatically run R functions on Hadoop 
by using Hive tables as source 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Oracle Big Data SQL 
•Part of Oracle Big Data 4.0 (BDA-only) 
‣Also requires Oracle Database 12c, Oracle Exadata Database Machine 
•Extends Oracle Data Dictionary to cover Hive 
•Extends Oracle SQL and SmartScan to Hadoop 
•More efficient access than Oracle Direct Connector for HDFS 
•Extends Oracle Security Model over Hadoop 
‣Fine-grained access control 
‣Data redaction, data masking 
Exadata 
Storage Servers 
Exadata Database 
Server 
Hadoop 
Cluster 
Oracle Big 
Data SQL 
SQL Queries 
SmartScan SmartScan
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Oracle Data Integrator 12c 
•Oracle’s data integration tool for loading, transforming and integrating enterprise data 
•Successor to Oracle Warehouse Builder, part of wider Oracle DI platform 
•Connectivity to most RBDMS, file and application sources
Oracle Data Integrator on Hadoop 
•ODI provides an excellent framework for running Hadoop ETL jobs 
‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster 
•Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation 
‣Whilst still preserving RDBMS push-down 
‣Extensible to cover Pig, Spark etc 
•Process orchestration 
•Data quality / error handling 
•Metadata and model-driven 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Demo 
ODI12c within the Big Data Lite VM 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Native Processing using MapReduce, Pig, Hive etc 
•Native processing using Hadoop framework, using Knowledge Module code templates 
•ODI generates native code for each platform, taking a template for each step + adding 
table names, column names, join conditions etc 
‣Easy to extend 
‣Easy to read the code 
‣Makes it possible for ODI to 
support Spark, Pig etc in future 
‣Uses the power of the target 
platform for integration tasks 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
-Hadoop-native ETL
Bulk Data Loading into Hadoop through Sqoop, Files 
•ODI12c 12.1.3 comes with Sqoop support, for bulk-import and export out of RDBMS 
‣Preferred method for bulk-loading database sourced data 
•File loading can be done through IKM File to Hive, 
or through Hadoop FS shell 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
ODI Integration with Oracle Big Data Adapters & GoldenGate 
•GoldenGate (and Flume) for data loading 
•OLH and ODCH for data exporting 
•ORAAH for analysis 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Load 
to 
Oracle 
OLH/OSCH 
ODI 
Transform 
Hive 
Hive/HDFS 
Federate 
Hive/HDFS 
to 
Oracle 
Big 
Data 
SQL 
Oracle 
DB 
OLTP 
Load 
from 
Oracle 
CopyToBDA 
Hive/HDFS 
Federate 
Oracle 
to 
Hive 
Query 
Provider 
for 
Hadoop 
OOGGGG Hive/HDFS
Loading from NoSQL Sources 
•NoSQL databases are often used in conjunction with Hadoop 
•Typically provide a flexible schema vs No schema (HDFS) or tabular schema (Hive) 
•Usually provide CRUD capabilities vs. HDFS’s write-only storage 
•Typical use-cases include 
‣High-velocity event loading (Oracle NoSQL Database) 
‣Providing a means to support CRUD over HDFS (HBase) 
‣Loading JSON documents (MongoDB) 
•NoSQL data access usually through APIs 
‣Primarily aimed add app developers 
•Hive storage handlers and other solutions can be 
used in a BI / DW / ETL context 
‣HBase support in ODI12c 12.1.3 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Lesson 2 : Hadoop Data Loading 
Example Scenario : Webserver Log Analysis 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Scenario Overview 
•Rittman Mead website hosts the Rittman Mead blog, 
plus service offerings, news, article downloads etc 
•Typical traffic is around 4-6k pageviews / day 
•Hosted on Amazon AWS, runs on Wordpress 
•We would like to better understand site activity 
‣Which pages are most popular? 
‣Where do our visitors come from? 
‣Which blog articles and authors are most popular? 
‣What other activity around the blog (social media etc) 
influences traffic on the site?
ODI and Big Data Integration Example 
•In this seminar, we’ll show an end-to-end ETL process on Hadoop using ODI12c & BDA 
•Load webserver log data into Hadoop, process enhance and aggregate, 
then load final summary table into Oracle Database 12c 
•Originally developed on full Hadoop cluster, but ported to BigDataLite 4.0 VM for seminar 
‣Process using Hadoop framework 
‣Leverage Big Data Connectors 
‣Metadata-based ETL development 
using ODI12c 
‣Real-world example 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Initial ETL & Data Flow through BDA System 
•Five-step process to load, transform, aggregate and filter incoming log data 
•Leverage ODI’s capabilities where possible 
•Make use of Hadoop power + scalability 
Flume 
Agent 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Sqoop extract 
! 
posts 
(Hive Table) 
IKM Hive Control Append 
(Hive table join & load into 
target hive table) 
categories_sql_ 
extract 
(Hive Table) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
hive_raw_apache_ 
access_log 
(Hive Table) 
Flume 
Agent 
!!!!!! 
Apache HTTP 
Server 
Log Files (HDFS) 
Flume Messaging 
TCP Port 4545 
(example) 
IKM File to Hive 
1 using RegEx SerDe 
log_entries_ 
and post_detail 
(Hive Table) 
IKM Hive Control Append 
(Hive table join & load into 
target hive table) 
hive_raw_apache_ 
access_log 
(Hive Table) 
2 3 
Geocoding 
IP>Country list 
(Hive Table) 
IKM Hive Transform 
(Hive streaming through 
Python script) 
4 5 
hive_raw_apache_ 
access_log 
(Hive Table) 
IKM File / Hive to Oracle 
(bulk unload to Oracle DB)
Load Incoming Log Files into Hadoop using Flume 
•Web server log entries will be ingested into Hadoop using Flume 
•Flume collector configured on webserver, sink on Hadoop node(s) 
•Log activity buffered using Flume channels 
•Effectively replicates in real-time the log activity on RM webserver 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Apache HTTP Server 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
1 
Hadoop Node 2 ! 
HDFS Data Node 
Hadoop Node 3 ! 
HDFS Data Node 
Hadoop Node n ! 
HDFS Data Node 
Flume 
Agent 
Logs 
Hadoop Node 1 ! 
HDFS Client 
Flume 
Agent 
HDFS Name Node 
Flume Messages 
TCP Port 4545 
(example) 
HDFS packet 
writes (example)
Starting Flume Agents, Check Files Landing in HDFS Directory 
•Start the Flume agents on source and target (BDA) servers 
•Check that incoming file data starts appearing in HDFS 
‣Note - files will be continuously written-to as 
entries added to source log files 
‣Channel size for source, target agents 
determines max no. of events buffered 
‣If buffer exceeded, new events dropped 
until buffer < channel size 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Demo 
Log Files Landed into HDFS using Flume 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Initial Data Discovery / Exploration using R 
•Run basic analysis and output high-level metrics on the incoming data 
•Copy sample of incoming log files (via Flume) to local filesystem for analysis 
‣Or use ORAAH to access them in HDFS directly 
! 
! 
! 
! 
! 
•Do initial count to check how many rows in file 
[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/rm_logs 
Found 5 items 
-rwxr-xr-x 1 oracle oracle 41364846 2014-09-27 23:49 /user/oracle/rm_logs/access_log 
-rwxr-xr-x 1 oracle oracle 359299529 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140323 
-rwxr-xr-x 1 oracle oracle 364337863 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140330 
-rwxr-xr-x 1 oracle oracle 350690205 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140406 
-rwxr-xr-x 1 oracle oracle 328643100 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140413 
[oracle@bigdatalite ~]$ hadoop fs -copyToLocal /user/oracle/rm_logs/access_log-20140323 $HOME/sample_logs 
[oracle@bigdatalite ~]$ R 
> logfilename <- "/home/oracle/sample_logs/access_log-20140323" 
> length(readLines(logfilename)) 
[1] 1435524 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Initial Data Discovery / Exploration using R 
•Display range of values for log file elements, split on the standard Apache log delimiter char 
•Use output to determine potential column names 
> df <- read.table(logfilename, colClasses="character", header=FALSE, sep="", quote=""'") 
> str(df) 
'data.frame': 1435524 obs. of 10 variables: 
$ V1 : chr "103.255.250.7" "54.228.204.102" "170.148.198.157" "184.171.240.84" ... 
$ V2 : chr "-" "-" "-" "-" ... 
$ V3 : chr "-" "-" "-" "-" ... 
$ V4 : chr "[16/Mar/2014:03:19:24" "[16/Mar/2014:03:19:30" "[16/Mar/2014:03:19:30" "[16/Mar/2014:03:19:40" ... 
$ V5 : chr "+0000]" "+0000]" "+0000]" "+0000]" ... 
$ V6 : chr "GET / HTTP/1.0" "POST /wp-cron.php?doing_wp_cron=1394939970.6438250541687011718750 HTTP/1.0" "GET /feed/ 
HTTP/1.1" "POST /wp-login.php HTTP/1.0" ... 
$ V7 : chr "301" "200" "200" "200" ... 
$ V8 : chr "235" "-" "36256" "3529" ... 
$ V9 : chr "-" "-" "-" "-" ... 
$ V10: chr "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)" "WordPress/3.7.1; 
http://www.rittmanmead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.9) Gecko/20100101 Firefox/10.0.9" "-" ... 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Initial Data Discovery / Exploration using R 
•Apply column names to data frame based on Apache Combined Log Format 
•Convert date/time column to date datatype 
> colnames(df) = c('host','ident','authuser','datetime','timezone','request','status','bytes','referer','browser') 
> df$datetime <- as.Date(df$datetime, "[%d/%b/%Y:%H:%M:%S") 
> str(df) 
'data.frame': 1435524 obs. of 10 variables: 
$ host : chr "103.255.250.7" "54.228.204.102" "170.148.198.157" "184.171.240.84" ... 
$ ident : chr "-" "-" "-" "-" ... 
$ authuser: chr "-" "-" "-" "-" ... 
$ datetime: Date, format: "2014-03-16" "2014-03-16" ... 
$ timezone: chr "+0000]" "+0000]" "+0000]" "+0000]" ... 
$ request : chr "GET / HTTP/1.0" "POST /wp-cron.php?doing_wp_cron=1394939970.6438250541687011718750 HTTP/1.0" 
"GET /feed/ HTTP/1.1" "POST /wp-login.php HTTP/1.0" ... 
$ status : chr "301" "200" "200" "200" ... 
$ bytes : chr "235" "-" "36256" "3529" ... 
$ referer : chr "-" "-" "-" "-" ... 
$ browser : chr "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)" 
"WordPress/3.7.1; http://www.rittmanmead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.9) Gecko/20100101 Firefox/10.0.9" 
"-" ... 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Initial Data Discovery / Exploration using R 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
•Display first n rows from the data frame 
> head(df,3) 
host ident authuser datetime timezone 
1 103.255.250.7 - - 2014-03-16 +0000] 
2 54.228.204.102 - - 2014-03-16 +0000] 
3 170.148.198.157 - - 2014-03-16 +0000] 
request 
1 GET / HTTP/1.0 
2 POST /wp-cron.php?doing_wp_cron=1394939970.6438250541687011718750 HTTP/1.0 
3 GET /feed/ HTTP/1.1 
status bytes referer 
1 301 235 - 
2 200 - - 
3 200 36256 - 
browser 
1 Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com) 
2 WordPress/3.7.1; http://www.rittmanmead.com 
3 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.9) Gecko/20100101 Firefox/10.0.9
Initial Data Discovery / Exploration using R 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
•Display quick charts of column values 
‣Determine high and low values 
‣Understand distribution etc 
> ! 
reqs <- table(df$datetime) 
> plot(reqs) 
> ! 
status <- table(df$status) 
> ! 
barplot(status) 
•Display count of unique site visitors 
! 
> length(unique(df$host)) 
! 
[1] 33147 
•Run similar queries until point where 
basic structure + content of data is 
understood
Parse and Process Log Files into Structured Hive Tables 
•Next step in process is to load the incoming log files into a Hive table 
‣Provides structure to data, makes it easier to access individual log elements 
‣Also need to parse the log entries to extract request, date, IP address etc columns 
‣Hive table can then easily be used in downstream transformations 
•Option #1 : Use ODI12c IKM File to Hive (LOAD DATA) KM 
‣Source can be local files or HDFS 
‣Either load file into Hive HDFS area, 
or leave as external Hive table 
‣Ability to use SerDe to parse file data 
‣Option #2 : Define Hive table manually using SerDe 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Configuring ODI12c Topology and Models 
•HDFS data servers (source) defined using generic File technology 
•Workaround to support IKM Hive Control Append 
•Leave JDBC driver blank, put HDFS URL in JDBC URL field 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Defining Physical Schema and Model for HDFS Directory 
•Hadoop processes typically access a whole directory of files in HDFS, rather than single one 
•Hive, Pig etc aggregate all files in that directory and treat as single file 
•ODI Models usually point to a single file though - 
how do you set up access correctly? 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Defining Physical Schema and Model for HDFS Directory 
•ODI appends file name to Physical Schema name for Hive access 
•To access a directory, set physical 
schema to parent directory 
•Set model Resource Name to 
directory you want to use as source 
•Note - need to manually enter file/ 
resource names, and “Test” button 
does not work for HDFS sources 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Defining Topology and Model for Hive Sources 
•Hive supported “out-of-the-box” with ODI12c (but requires ODIAAH license for KMs) 
•Most recent Hadoop distributions use HiveServer2 rather than HiveServer 
•Need to ensure JDBC drivers support Hive version 
•Use correct JDBC URL format (jdbc:hive2//…) 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Hive Tables and Underlying HDFS Storage Permissions 
•Hadoop by default has quite loose security 
•Files in HDFS organized into directories, using Unix-like permissions 
•Hive tables can be created by any user, over directories they have read-access to 
‣But that user might not have write permissions on the underlying directory 
‣Causes mapping execution failures in ODI if directory read-only 
•Therefore ensure you have read/write access to directories used by Hive, 
and create tables under the HDFS user you’ll access files through JDBC 
‣Simplest approach - create Hue user for “oracle”, create Hive tables under that user 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Final Model and Datastore Definitions 
•HDFS files for incoming log data, and any other input data 
•Hive tables for ETL targets and downstream processing 
•Use RKM Hive to reverse-engineer column definition from Hive 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Demo 
Viewing the Hive Loading Structures in ODI12c 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Using IKM File to Hive to Load Web Log File Data into Hive 
•Create mapping to load file source (single column for weblog entries) into Hive table 
•Target Hive table should have column for incoming log row, and parsed columns 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Specifying a SerDe to Parse Incoming Hive Data 
•SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats 
•Distributed as JAR file, gives Hive ability to parse semi-structured formats 
•We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns 
•Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Distributing SerDe JAR Files for Hive across Cluster 
•Hive SerDe functionality typically requires additional JARs to be made available to Hive 
•Following steps must be performed across ALL BDA nodes: 
‣Add JAR reference to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive.env.sh 
! 
! 
! 
‣Add JAR file to /usr/lib/hadoop 
! 
! 
! 
‣Restart YARN / MR1 TaskTrackers across cluster 
export HIVE_AUX_JARS_PATH=/usr/lib/hive/lib/hive-contrib-0.12.0-cdh5.0.1.jar:$ 
(echo $HIVE_AUX_JARS_PATH… 
[root@bdanode1 hadoop]# ls /usr/lib/hadoop/hive-* 
/usr/lib/hadoop/hive-contrib-0.12.0-cdh5.0.1.jar 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Executing First ODI12c Mapping 
•EXTERNAL_TABLE option chosen in IKM File to Hive (LOAD DATA) as Flume will continue 
writing to it until source log rotate 
•View results of data load in ODI Studio 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Alternative to ODI IKM File to Hive Loading 
•You could just define a Hive table as EXTERNAL, pointing to the incoming files 
•Add SerDe clause into the table definition, then just read from that table into rest of process 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
CREATE EXTERNAL TABLE apachelog ( 
host STRING, 
identity STRING, 
user STRING, 
time STRING, 
request STRING, 
status STRING, 
size STRING, 
referer STRING, 
agent STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*])([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") 
([^ "]*|"[^"]*"))?", 
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" 
) 
STORED AS TEXTFILE 
LOCATION '/user/root/logs';
Demo 
Viewing the Parsed Log Data in Hive 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Adding Social Media Datasources to the Hadoop Dataset 
•The log activity from the Rittman Mead website tells us what happened, but not “why” 
•Common customer requirement now is to get a “360 degree view” of their activity 
‣Understand what’s being said about them 
‣External drivers for interest, activity 
‣Understand more about customer intent, opinions 
•One example is to add details of social media mentions, 
likes, tweets and retweets etc to the transactional dataset 
‣Correlate twitter activity with sales increases, drops 
‣Measure impact of social media strategy 
‣Gather and include textual, sentiment, contextual 
data from surveys, media etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Example : Supplement Webserver Log Activity with Twitter Data 
•Datasift provide access to the Twitter “firehose” along with Facebook data, Tumblr etc 
•Developer-friendly APIs and ability to define search terms, keywords etc 
•Pull (historical data) or Push (real-time) delivery using many formats / end-points 
‣Most commonly-used consumption format is JSON, loaded into Redis, MongoDB etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
What is MongoDB? 
•Open-source document-store NoSQL database 
•Flexible data model, each document (record) 
can have its own JSON schema 
•Highly-scalable across multiple nodes (shards) 
•MongoDB databases made up of 
collections of documents 
‣Add new attributes to a document just by using it 
‣Single table (collection) design, no joins etc 
‣Very useful for holding JSON output from web apps 
- for example, twitter data from Datasift
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Hive and MongoDB 
•MongoDB Hadoop connector provides a storage handler for Hive tables 
•Rather than store its data in HDFS, the Hive table uses MongoDB for storage instead 
•Define in SerDe properties the Collection elements you want to access, using dot notation 
•https://github.com/mongodb/mongo-hadoop 
CREATE TABLE tweet_data( 
interactionId string, 
username string, 
content string, 
author_followers int) 
ROW FORMAT SERDE 
'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
'mongo.columns.mapping'='{"interactionId":"interactionId", 
"username":"interaction.interaction.author.username", 
"content":"interaction.interaction.content", 
"author_followers_count":"interaction.twitter.user.followers_count"}' 
) 
TBLPROPERTIES ( 
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' 
)
Demo 
MongoDB and the Incoming Twitter Dataset 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Adding MongoDB Datasets into the ODI Repository 
•Define Hive table outside of ODI, using MongoDB storage handler 
•Select the document elements of interest, project into Hive columns 
•Add Hive source to Topology if needed, then use Hive RKM to bring in column metadata 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Demo 
MongoDB Accessed through Hive 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Summary : Data Loading Phase 
•We’ve now landed log activity from the Rittman Mead website into Hadoop, using Flume 
•Data arrives as Apache Webserver log files, is then loaded into a Hive table and parsed 
•Supplemented by social media activity (Twitter) accessed through a MongoDB database 
•Now we can start processing, analysing, supplementing and working with the dataset… 
RDBMS 
Imports 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Loading 
Stage 
File 
Exports ✓ 
!!!! 
Processing 
Stage 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
!!!! 
Store / Export 
Stage 
!!!! 
Real-Time 
Logs / Events 
File / 
Unstructured 
Imports 
RDBMS 
Exports
Lesson 2 : Hadoop & NoSQL Data Loading 
using Hadoop Tools and ODI12c 
Mark Rittman, CTO, Rittman Mead 
SIOUG and HROUG Conferences, Oct 2014 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com

Weitere ähnliche Inhalte

Was ist angesagt?

Ougn2013 high speed, in-memory big data analysis with oracle exalytics
Ougn2013   high speed, in-memory big data analysis with oracle exalyticsOugn2013   high speed, in-memory big data analysis with oracle exalytics
Ougn2013 high speed, in-memory big data analysis with oracle exalytics
Mark Rittman
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
Yahoo Developer Network
 

Was ist angesagt? (20)

Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data ConnectorsDeep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Ougn2013 high speed, in-memory big data analysis with oracle exalytics
Ougn2013   high speed, in-memory big data analysis with oracle exalyticsOugn2013   high speed, in-memory big data analysis with oracle exalytics
Ougn2013 high speed, in-memory big data analysis with oracle exalytics
 
2014 july 24_what_ishadoop
2014 july 24_what_ishadoop2014 july 24_what_ishadoop
2014 july 24_what_ishadoop
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
GoldenGate and Oracle Data Integrator - A Perfect Match...
GoldenGate and Oracle Data Integrator - A Perfect Match...GoldenGate and Oracle Data Integrator - A Perfect Match...
GoldenGate and Oracle Data Integrator - A Perfect Match...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
Implementing the Business Catalog in the Modern Enterprise: Bridging Traditio...
 

Andere mochten auch

Andere mochten auch (20)

Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Flume Case Study
Flume Case StudyFlume Case Study
Flume Case Study
 
'Flume' Case Study
'Flume' Case Study'Flume' Case Study
'Flume' Case Study
 
Apache flume
Apache flumeApache flume
Apache flume
 
CV - Vivek Bajpai
CV - Vivek BajpaiCV - Vivek Bajpai
CV - Vivek Bajpai
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
PHP and MySQL : Server Side Scripting For Web Development
PHP and MySQL : Server Side Scripting For Web DevelopmentPHP and MySQL : Server Side Scripting For Web Development
PHP and MySQL : Server Side Scripting For Web Development
 
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An IntroductionChicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
 
Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010Hadoop - Integration Patterns and Practices__HadoopSummit2010
Hadoop - Integration Patterns and Practices__HadoopSummit2010
 
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
Hadoop World 2011: Storing and Indexing Social Media Content in the Hadoop Ec...
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Spring for Apache Hadoop
Spring for Apache HadoopSpring for Apache Hadoop
Spring for Apache Hadoop
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns Designing a reactive data platform: Challenges, patterns, and anti-patterns
Designing a reactive data platform: Challenges, patterns, and anti-patterns
 
How to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin LeauHow to develop Big Data Pipelines for Hadoop, by Costin Leau
How to develop Big Data Pipelines for Hadoop, by Costin Leau
 
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 

Ähnlich wie Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c

Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Pentaho
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
Alluxio, Inc.
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
Henk van der Valk
 

Ähnlich wie Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c (20)

Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!The other Apache technologies your big data solution needs!
The other Apache technologies your big data solution needs!
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Hive at booking
Hive at bookingHive at booking
Hive at booking
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Kudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
 
StreamHorizon and bigdata overview
StreamHorizon and bigdata overviewStreamHorizon and bigdata overview
StreamHorizon and bigdata overview
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 

Mehr von Mark Rittman

Mehr von Mark Rittman (20)

The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
 
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsBig Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle Cloud
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
 

Kürzlich hochgeladen

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 

Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c

  • 1. Lesson 2 : Hadoop & NoSQL Data Loading using Hadoop Tools and ODI12c Mark Rittman, CTO, Rittman Mead SIOUG and HROUG Conferences, Oct 2014 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 2. Moving Data In, Around and Out of Hadoop •Three stages to Hadoop data movement, with dedicated Apache / other tools ‣Load : receive files in batch, or in real-time (logs, events) ‣Transform : process & transform data to answer questions ‣Store / Export : store in structured form, or export to RDBMS using Sqoop RDBMS Imports T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Loading Stage !!!! Processing Stage E : info@rittmanmead.com W : www.rittmanmead.com !!!! Store / Export Stage !!!! Real-Time Logs / Events File / Unstructured Imports File Exports RDBMS Exports
  • 3. Lesson 2 : Hadoop Data Loading Hadoop Data Loading / Ingestion Fundamentals T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 4. Core Apache Hadoop Tools •Apache Hadoop, including MapReduce and HDFS ‣Scaleable, fault-tolerant file storage for Hadoop ‣Parallel programming framework for Hadoop •Apache Hive ‣SQL abstraction layer over HDFS ‣Perform set-based ETL within Hadoop •Apache Pig, Spark ‣Dataflow-type languages over HDFS, Hive etc ‣Extensible through UDFs, streaming etc •Apache Flume, Apache Sqoop, Apache Kafka ‣Real-time and batch loading into HDFS ‣Modular, fault-tolerant, wide source/target coverage T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 5. Demo Hue with CDH5 on the Big Data Lite VM T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 6. Other Tools Typically Used… •Python, Scala, Java and other programming languages ‣For more complex and procedural transformations •Shell scripts, sed, awk, regexes etc •R and R-on-Hadoop ‣Typically at the “discovery” phase •And down the line - ETL tools to automate the process ‣ODI, Pentaho Data Integrator etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 7. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Data Loading into Hadoop •Default load type is real-time, streaming loads ‣Batch / bulk loads only typically used to seed system •Variety of sources including web log activity, event streams •Target is typically HDFS (Hive) or HBase •Data typically lands in “raw state” ‣Lots of files and events, need to be filtered/aggregated ‣Typically semi-structured (JSON, logs etc) ‣High volume, high velocity -Which is why we use Hadoop rather than RBDMS (speed vs. ACID trade-off) ‣Economics of Hadoop means its often possible to archive all incoming data at detail level Loading Stage !!!! Real-Time Logs / Events File / Unstructured Imports
  • 8. Apache Flume : Distributed Transport for Log Activity •Apache Flume is the standard way to transport log files from source through to target •Initial use-case was webserver log files, but can transport any file from A>B •Does not do data transformation, but can send to multiple targets / target types •Mechanisms and checks to ensure successful transport of entries •Has a concept of “agents”, “sinks” and “channels” •Agents collect and forward log data •Sinks store it in final destination •Channels store log data en-route •Simple configuration through INI files •Handled outside of ODI12c T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 9. Apache Flume : Agents, Channels and Sinks •Multiple agents can be used to capture logs from many sources, combine into one output •Needs at least one source agent, and a target agent •Agents can be multi-step, handing-off data across the topology •Channels store data in files, or in RAM, as a buffer between steps •Log files being continuously written to have contents trickle-fed across to source •Sink types for Hive, HBase and many others •Free software, part of Hadoop platform T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 10. Typical Flume Use Case : Copy Log Files to HDFS / Hive •Typical use of Flume is to copy log entries from servers onto Hadoop / HDFS •Tightly integrated with Hadoop framework •Mirror server log files into HDFS, aggregate logs from >1 server •Can aggregate, filter and transform incoming data before writing to HDFS •Alternatives to log file “tailing” - HTTP GET / PUSH etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 11. Flume Source / Target Configuration •Conf file for source system agent •TCP port, channel size+type, source type T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) •Conf file for target system agent •TCP port, channel size+type, sink type E : info@rittmanmead.com W : www.rittmanmead.com
  • 12. Apache Kafka : Reliable, Message-Based •Developed by LinkedIn, designed to address Flume issues around reliability, throughput ‣(though many of those issues have been addressed since) •Designed for persistent messages as the common use case ‣Website messages, events etc vs. log file entries •Consumer (pull) rather than Producer (push) model •Supports multiple consumers per message queue •More complex to set up than Flume, and can use Flume as a consumer of messages ‣But gaining popularity, especially alongside Spark Streaming T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 13. GoldenGate for Continuous Streaming to Hadoop •Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop •Leverages GoldenGate & HDFS / Hive Java APIs •Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive) •Likely to be formal part of GoldenGate in future release - but usable now •Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 14. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Bulk-Loading into Hadoop •Typically used for initial data load, or for one-off analysis •Aim for bulk-loading is to copy external dataset into HDFS ‣From files (delimited, semi-structured, XML, JSON etc) ‣From databases or other structured data stores •Main tools used for bulk-data loading include ‣Hadoop FS Shell ‣Sqoop Loading Stage !!!! RDBMS Imports File / Unstructured Imports
  • 15. Hadoop FS Shell Commands •Follows typical Unix/Linux command naming ! ! ! ! ! ! •Additional commands for bulk data movement [oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff [oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracle Found 5 items drwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.staging drwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemo drwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/moviework drwxrwxrwx - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuff drwxrwxrwx - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stage $ hadoop fs -copyFromLocal <local_dir> <hdfs_dir> $ hadoop fs -copyToLocal <hdfs_dir> <local_dir> T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 16. Apache Sqoop : SQL to Hadoop •Apache top-level project, typically ships with most Hadoop distributions •Tool to transfer data from relational database systems ‣Oracle, mySQL, PostgreSQL, Teradata etc •Loads into, and exports out of, Hadoop ecosystem ‣Uses JDBC drivers to connect to RBDMS source/target ‣Data transferred in/out of Hadoop using parallel Map-only Hadoop jobs -Sqoop introspects source / target RBDMS to determine structure, table metadata -Job tracker splits import / export into separate jobs, based on split column(s) T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Map Map Map Map HDFS Storage
  • 17. Sqoop Command-Line Parameters sqoop import —connect jdbc:oracle:thin:@centraldb11gr2.rittmandev.com:1521/ ctrl11g.rittmandev.com —username blog_refdata —password password —query ‘SELECT p.post_id, c.cat_name from post_one_cat p, categories c where p.cat_id = c.cat_id and $CONDITIONS’ —target_dir /user/oracle/post_categories —hive-import —hive-overwrite —hive-table post_categories —split-by p.post_id •—username, —password : database account username and password •—query : SELECT statement to retrieve data (can use —table instead, for single table) •$CONDITIONS, —split-by : column by which MapReduce jobs can be run in parallel •—hive-import, —hive-overwrite, —hive-table : name and load mode for Hive table •— target_dir : target HDFS directory to land data in initially (required for SELECT) T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 18. Data Storage and Formats within Hadoop Clusters •Data landing in Hadoop clusters typically is in raw, unprocessed form •May arrive as log files, XML files, JSON documents, machine / sensor data •Typically needs to go through a processing, filtering and aggregation phase to be useful •Final output of processing stage is usually structured files, or Hive tables RDBMS Imports T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Loading Stage !!!! Processing Stage E : info@rittmanmead.com W : www.rittmanmead.com !!!! Store / Export Stage !!!! Real-Time Logs / Events File / Unstructured Imports File Exports RDBMS Exports
  • 19. Initial Data Scoping & Discovery using R •R is typically used at start of a big data project to get a high-level understanding of the data •Can be run as R standalone, or using Oracle R Advanced Analytics for Hadoop •Do basic scan of incoming dataset, get counts, determine delimiters etc •Distribution of values for columns •Basic graphs and data discovery •Use findings to drive design of parsing logic, Hive data structures, need for data scrubbing / correcting etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 20. Apache Hive : SQL Access + Table Metadata Over HDFS •Apache Hive provides a SQL layer over Hadoop, once we understand the structure (schema) of the data we’re working with •Exposes HDFS and other Hadoop data as tables and columns •Provides a simple SQL dialect for queries called HiveQL •SQL queries are turned into MapReduce jobs under-the-covers •JDBC and ODBC drivers provide access to BI and ETL tools •Hive metastore (data dictionary) leveraged by many other Hadoop tools ‣Apache Pig ‣Cloudera Impala ‣etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) SELECT a, sum(b) FROM myTable WHERE a<100 GROUP BY a E : info@rittmanmead.com W : www.rittmanmead.com Map Task Map Task Map Task Reduce Task Reduce Task Result
  • 21. Demo Hive within Hue on the Big Data Lite VM T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 22. Hive SerDes & Storage Handlers •Plug-in technologies that extend Hive to handle new data formats and semi-structured sources •Typically distributed as JAR files, hosted on sites such as GitHub •Can be used to parse log files, access data in NoSQL databases, Amazon S3 etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com CREATE EXTERNAL TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/root/logs'; CREATE TABLE tweet_data( interactionId string, username string, content string, author_followers int) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{"interactionId":"interactionId", "username":"interaction.interaction.author.username", "content":"interaction.interaction.content", "author_followers_count":"interaction.twitter.user.followers_count"}' ) TBLPROPERTIES ( 'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' )
  • 23. Lesson 2 : Hadoop Data Loading Oracle’s Hadoop Data Loading Toolkit T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 24. Oracle’s Big Data Products •Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing ‣Cloudera Distribution of Hadoop ‣Cloudera Manager ‣Open-source R ‣Oracle NoSQL Database ‣Oracle Enterprise Linux + Oracle JVM ‣New - Oracle Big Data SQL •Oracle Big Data Connectors ‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS) ‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS) ‣Oracle R Advanced Analytics for Hadoop ‣Oracle Data Integrator 12c T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 25. Oracle Big Data Connectors •Oracle-licensed utilities to connect Hadoop to Oracle RBDMS ‣Bulk-extract data from Hadoop to Oracle, or expose HDFS / Hive data as external tables ‣Run R analysis and processing on Hadoop ‣Leverage Hadoop compute resources to offload ETL and other work from Oracle RBDMS ‣Enable Oracle SQL to access and load Hadoop data T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 26. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle Loader for Hadoop •Oracle technology for accessing Hadoop data, and loading it into an Oracle database •Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce •Direct-path loads into Oracle Database, partitioned and non-partitioned •Online and offline loads •Key technology for fast load of Hadoop results into Oracle DB
  • 27. Oracle Direct Connector for HDFS •Enables HDFS as a data-source for Oracle Database external tables •Effectively provides Oracle SQL access over HDFS •Supports data query, or import into Oracle DB •Treat HDFS-stored files in the same way as regular files ‣But with HDFS’s low-cost ‣… and fault-tolerance T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 28. Oracle R Advanced Analytics for Hadoop •Add-in to R that extends capability to Hadoop •Gives R the ability to create Map and Reduce functions •Extends R data frames to include Hive tables ‣Automatically run R functions on Hadoop by using Hive tables as source T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 29. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle Big Data SQL •Part of Oracle Big Data 4.0 (BDA-only) ‣Also requires Oracle Database 12c, Oracle Exadata Database Machine •Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •More efficient access than Oracle Direct Connector for HDFS •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking Exadata Storage Servers Exadata Database Server Hadoop Cluster Oracle Big Data SQL SQL Queries SmartScan SmartScan
  • 30. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle Data Integrator 12c •Oracle’s data integration tool for loading, transforming and integrating enterprise data •Successor to Oracle Warehouse Builder, part of wider Oracle DI platform •Connectivity to most RBDMS, file and application sources
  • 31. Oracle Data Integrator on Hadoop •ODI provides an excellent framework for running Hadoop ETL jobs ‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster •Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation ‣Whilst still preserving RDBMS push-down ‣Extensible to cover Pig, Spark etc •Process orchestration •Data quality / error handling •Metadata and model-driven T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 32. Demo ODI12c within the Big Data Lite VM T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 33. Native Processing using MapReduce, Pig, Hive etc •Native processing using Hadoop framework, using Knowledge Module code templates •ODI generates native code for each platform, taking a template for each step + adding table names, column names, join conditions etc ‣Easy to extend ‣Easy to read the code ‣Makes it possible for ODI to support Spark, Pig etc in future ‣Uses the power of the target platform for integration tasks T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com -Hadoop-native ETL
  • 34. Bulk Data Loading into Hadoop through Sqoop, Files •ODI12c 12.1.3 comes with Sqoop support, for bulk-import and export out of RDBMS ‣Preferred method for bulk-loading database sourced data •File loading can be done through IKM File to Hive, or through Hadoop FS shell T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 35. ODI Integration with Oracle Big Data Adapters & GoldenGate •GoldenGate (and Flume) for data loading •OLH and ODCH for data exporting •ORAAH for analysis T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Load to Oracle OLH/OSCH ODI Transform Hive Hive/HDFS Federate Hive/HDFS to Oracle Big Data SQL Oracle DB OLTP Load from Oracle CopyToBDA Hive/HDFS Federate Oracle to Hive Query Provider for Hadoop OOGGGG Hive/HDFS
  • 36. Loading from NoSQL Sources •NoSQL databases are often used in conjunction with Hadoop •Typically provide a flexible schema vs No schema (HDFS) or tabular schema (Hive) •Usually provide CRUD capabilities vs. HDFS’s write-only storage •Typical use-cases include ‣High-velocity event loading (Oracle NoSQL Database) ‣Providing a means to support CRUD over HDFS (HBase) ‣Loading JSON documents (MongoDB) •NoSQL data access usually through APIs ‣Primarily aimed add app developers •Hive storage handlers and other solutions can be used in a BI / DW / ETL context ‣HBase support in ODI12c 12.1.3 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 37. Lesson 2 : Hadoop Data Loading Example Scenario : Webserver Log Analysis T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 38. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Scenario Overview •Rittman Mead website hosts the Rittman Mead blog, plus service offerings, news, article downloads etc •Typical traffic is around 4-6k pageviews / day •Hosted on Amazon AWS, runs on Wordpress •We would like to better understand site activity ‣Which pages are most popular? ‣Where do our visitors come from? ‣Which blog articles and authors are most popular? ‣What other activity around the blog (social media etc) influences traffic on the site?
  • 39. ODI and Big Data Integration Example •In this seminar, we’ll show an end-to-end ETL process on Hadoop using ODI12c & BDA •Load webserver log data into Hadoop, process enhance and aggregate, then load final summary table into Oracle Database 12c •Originally developed on full Hadoop cluster, but ported to BigDataLite 4.0 VM for seminar ‣Process using Hadoop framework ‣Leverage Big Data Connectors ‣Metadata-based ETL development using ODI12c ‣Real-world example T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 40. Initial ETL & Data Flow through BDA System •Five-step process to load, transform, aggregate and filter incoming log data •Leverage ODI’s capabilities where possible •Make use of Hadoop power + scalability Flume Agent T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Sqoop extract ! posts (Hive Table) IKM Hive Control Append (Hive table join & load into target hive table) categories_sql_ extract (Hive Table) E : info@rittmanmead.com W : www.rittmanmead.com hive_raw_apache_ access_log (Hive Table) Flume Agent !!!!!! Apache HTTP Server Log Files (HDFS) Flume Messaging TCP Port 4545 (example) IKM File to Hive 1 using RegEx SerDe log_entries_ and post_detail (Hive Table) IKM Hive Control Append (Hive table join & load into target hive table) hive_raw_apache_ access_log (Hive Table) 2 3 Geocoding IP>Country list (Hive Table) IKM Hive Transform (Hive streaming through Python script) 4 5 hive_raw_apache_ access_log (Hive Table) IKM File / Hive to Oracle (bulk unload to Oracle DB)
  • 41. Load Incoming Log Files into Hadoop using Flume •Web server log entries will be ingested into Hadoop using Flume •Flume collector configured on webserver, sink on Hadoop node(s) •Log activity buffered using Flume channels •Effectively replicates in real-time the log activity on RM webserver T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Apache HTTP Server E : info@rittmanmead.com W : www.rittmanmead.com 1 Hadoop Node 2 ! HDFS Data Node Hadoop Node 3 ! HDFS Data Node Hadoop Node n ! HDFS Data Node Flume Agent Logs Hadoop Node 1 ! HDFS Client Flume Agent HDFS Name Node Flume Messages TCP Port 4545 (example) HDFS packet writes (example)
  • 42. Starting Flume Agents, Check Files Landing in HDFS Directory •Start the Flume agents on source and target (BDA) servers •Check that incoming file data starts appearing in HDFS ‣Note - files will be continuously written-to as entries added to source log files ‣Channel size for source, target agents determines max no. of events buffered ‣If buffer exceeded, new events dropped until buffer < channel size T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 43. Demo Log Files Landed into HDFS using Flume T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 44. Initial Data Discovery / Exploration using R •Run basic analysis and output high-level metrics on the incoming data •Copy sample of incoming log files (via Flume) to local filesystem for analysis ‣Or use ORAAH to access them in HDFS directly ! ! ! ! ! •Do initial count to check how many rows in file [oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/rm_logs Found 5 items -rwxr-xr-x 1 oracle oracle 41364846 2014-09-27 23:49 /user/oracle/rm_logs/access_log -rwxr-xr-x 1 oracle oracle 359299529 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140323 -rwxr-xr-x 1 oracle oracle 364337863 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140330 -rwxr-xr-x 1 oracle oracle 350690205 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140406 -rwxr-xr-x 1 oracle oracle 328643100 2014-09-27 23:53 /user/oracle/rm_logs/access_log-20140413 [oracle@bigdatalite ~]$ hadoop fs -copyToLocal /user/oracle/rm_logs/access_log-20140323 $HOME/sample_logs [oracle@bigdatalite ~]$ R > logfilename <- "/home/oracle/sample_logs/access_log-20140323" > length(readLines(logfilename)) [1] 1435524 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 45. Initial Data Discovery / Exploration using R •Display range of values for log file elements, split on the standard Apache log delimiter char •Use output to determine potential column names > df <- read.table(logfilename, colClasses="character", header=FALSE, sep="", quote=""'") > str(df) 'data.frame': 1435524 obs. of 10 variables: $ V1 : chr "103.255.250.7" "54.228.204.102" "170.148.198.157" "184.171.240.84" ... $ V2 : chr "-" "-" "-" "-" ... $ V3 : chr "-" "-" "-" "-" ... $ V4 : chr "[16/Mar/2014:03:19:24" "[16/Mar/2014:03:19:30" "[16/Mar/2014:03:19:30" "[16/Mar/2014:03:19:40" ... $ V5 : chr "+0000]" "+0000]" "+0000]" "+0000]" ... $ V6 : chr "GET / HTTP/1.0" "POST /wp-cron.php?doing_wp_cron=1394939970.6438250541687011718750 HTTP/1.0" "GET /feed/ HTTP/1.1" "POST /wp-login.php HTTP/1.0" ... $ V7 : chr "301" "200" "200" "200" ... $ V8 : chr "235" "-" "36256" "3529" ... $ V9 : chr "-" "-" "-" "-" ... $ V10: chr "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)" "WordPress/3.7.1; http://www.rittmanmead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.9) Gecko/20100101 Firefox/10.0.9" "-" ... T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 46. Initial Data Discovery / Exploration using R •Apply column names to data frame based on Apache Combined Log Format •Convert date/time column to date datatype > colnames(df) = c('host','ident','authuser','datetime','timezone','request','status','bytes','referer','browser') > df$datetime <- as.Date(df$datetime, "[%d/%b/%Y:%H:%M:%S") > str(df) 'data.frame': 1435524 obs. of 10 variables: $ host : chr "103.255.250.7" "54.228.204.102" "170.148.198.157" "184.171.240.84" ... $ ident : chr "-" "-" "-" "-" ... $ authuser: chr "-" "-" "-" "-" ... $ datetime: Date, format: "2014-03-16" "2014-03-16" ... $ timezone: chr "+0000]" "+0000]" "+0000]" "+0000]" ... $ request : chr "GET / HTTP/1.0" "POST /wp-cron.php?doing_wp_cron=1394939970.6438250541687011718750 HTTP/1.0" "GET /feed/ HTTP/1.1" "POST /wp-login.php HTTP/1.0" ... $ status : chr "301" "200" "200" "200" ... $ bytes : chr "235" "-" "36256" "3529" ... $ referer : chr "-" "-" "-" "-" ... $ browser : chr "Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com)" "WordPress/3.7.1; http://www.rittmanmead.com" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.9) Gecko/20100101 Firefox/10.0.9" "-" ... T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 47. Initial Data Discovery / Exploration using R T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com •Display first n rows from the data frame > head(df,3) host ident authuser datetime timezone 1 103.255.250.7 - - 2014-03-16 +0000] 2 54.228.204.102 - - 2014-03-16 +0000] 3 170.148.198.157 - - 2014-03-16 +0000] request 1 GET / HTTP/1.0 2 POST /wp-cron.php?doing_wp_cron=1394939970.6438250541687011718750 HTTP/1.0 3 GET /feed/ HTTP/1.1 status bytes referer 1 301 235 - 2 200 - - 3 200 36256 - browser 1 Mozilla/5.0 (compatible; monitis.com - free monitoring service; http://monitis.com) 2 WordPress/3.7.1; http://www.rittmanmead.com 3 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.9) Gecko/20100101 Firefox/10.0.9
  • 48. Initial Data Discovery / Exploration using R T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com •Display quick charts of column values ‣Determine high and low values ‣Understand distribution etc > ! reqs <- table(df$datetime) > plot(reqs) > ! status <- table(df$status) > ! barplot(status) •Display count of unique site visitors ! > length(unique(df$host)) ! [1] 33147 •Run similar queries until point where basic structure + content of data is understood
  • 49. Parse and Process Log Files into Structured Hive Tables •Next step in process is to load the incoming log files into a Hive table ‣Provides structure to data, makes it easier to access individual log elements ‣Also need to parse the log entries to extract request, date, IP address etc columns ‣Hive table can then easily be used in downstream transformations •Option #1 : Use ODI12c IKM File to Hive (LOAD DATA) KM ‣Source can be local files or HDFS ‣Either load file into Hive HDFS area, or leave as external Hive table ‣Ability to use SerDe to parse file data ‣Option #2 : Define Hive table manually using SerDe T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 50. Configuring ODI12c Topology and Models •HDFS data servers (source) defined using generic File technology •Workaround to support IKM Hive Control Append •Leave JDBC driver blank, put HDFS URL in JDBC URL field T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 51. Defining Physical Schema and Model for HDFS Directory •Hadoop processes typically access a whole directory of files in HDFS, rather than single one •Hive, Pig etc aggregate all files in that directory and treat as single file •ODI Models usually point to a single file though - how do you set up access correctly? T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 52. Defining Physical Schema and Model for HDFS Directory •ODI appends file name to Physical Schema name for Hive access •To access a directory, set physical schema to parent directory •Set model Resource Name to directory you want to use as source •Note - need to manually enter file/ resource names, and “Test” button does not work for HDFS sources T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 53. Defining Topology and Model for Hive Sources •Hive supported “out-of-the-box” with ODI12c (but requires ODIAAH license for KMs) •Most recent Hadoop distributions use HiveServer2 rather than HiveServer •Need to ensure JDBC drivers support Hive version •Use correct JDBC URL format (jdbc:hive2//…) T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 54. Hive Tables and Underlying HDFS Storage Permissions •Hadoop by default has quite loose security •Files in HDFS organized into directories, using Unix-like permissions •Hive tables can be created by any user, over directories they have read-access to ‣But that user might not have write permissions on the underlying directory ‣Causes mapping execution failures in ODI if directory read-only •Therefore ensure you have read/write access to directories used by Hive, and create tables under the HDFS user you’ll access files through JDBC ‣Simplest approach - create Hue user for “oracle”, create Hive tables under that user T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 55. Final Model and Datastore Definitions •HDFS files for incoming log data, and any other input data •Hive tables for ETL targets and downstream processing •Use RKM Hive to reverse-engineer column definition from Hive T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 56. Demo Viewing the Hive Loading Structures in ODI12c T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 57. Using IKM File to Hive to Load Web Log File Data into Hive •Create mapping to load file source (single column for weblog entries) into Hive table •Target Hive table should have column for incoming log row, and parsed columns T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 58. Specifying a SerDe to Parse Incoming Hive Data •SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats •Distributed as JAR file, gives Hive ability to parse semi-structured formats •We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns •Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 59. Distributing SerDe JAR Files for Hive across Cluster •Hive SerDe functionality typically requires additional JARs to be made available to Hive •Following steps must be performed across ALL BDA nodes: ‣Add JAR reference to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive.env.sh ! ! ! ‣Add JAR file to /usr/lib/hadoop ! ! ! ‣Restart YARN / MR1 TaskTrackers across cluster export HIVE_AUX_JARS_PATH=/usr/lib/hive/lib/hive-contrib-0.12.0-cdh5.0.1.jar:$ (echo $HIVE_AUX_JARS_PATH… [root@bdanode1 hadoop]# ls /usr/lib/hadoop/hive-* /usr/lib/hadoop/hive-contrib-0.12.0-cdh5.0.1.jar T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 60. Executing First ODI12c Mapping •EXTERNAL_TABLE option chosen in IKM File to Hive (LOAD DATA) as Flume will continue writing to it until source log rotate •View results of data load in ODI Studio T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 61. Alternative to ODI IKM File to Hive Loading •You could just define a Hive table as EXTERNAL, pointing to the incoming files •Add SerDe clause into the table definition, then just read from that table into rest of process T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com CREATE EXTERNAL TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*])([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/root/logs';
  • 62. Demo Viewing the Parsed Log Data in Hive T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 63. Adding Social Media Datasources to the Hadoop Dataset •The log activity from the Rittman Mead website tells us what happened, but not “why” •Common customer requirement now is to get a “360 degree view” of their activity ‣Understand what’s being said about them ‣External drivers for interest, activity ‣Understand more about customer intent, opinions •One example is to add details of social media mentions, likes, tweets and retweets etc to the transactional dataset ‣Correlate twitter activity with sales increases, drops ‣Measure impact of social media strategy ‣Gather and include textual, sentiment, contextual data from surveys, media etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 64. Example : Supplement Webserver Log Activity with Twitter Data •Datasift provide access to the Twitter “firehose” along with Facebook data, Tumblr etc •Developer-friendly APIs and ability to define search terms, keywords etc •Pull (historical data) or Push (real-time) delivery using many formats / end-points ‣Most commonly-used consumption format is JSON, loaded into Redis, MongoDB etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 65. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com What is MongoDB? •Open-source document-store NoSQL database •Flexible data model, each document (record) can have its own JSON schema •Highly-scalable across multiple nodes (shards) •MongoDB databases made up of collections of documents ‣Add new attributes to a document just by using it ‣Single table (collection) design, no joins etc ‣Very useful for holding JSON output from web apps - for example, twitter data from Datasift
  • 66. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Hive and MongoDB •MongoDB Hadoop connector provides a storage handler for Hive tables •Rather than store its data in HDFS, the Hive table uses MongoDB for storage instead •Define in SerDe properties the Collection elements you want to access, using dot notation •https://github.com/mongodb/mongo-hadoop CREATE TABLE tweet_data( interactionId string, username string, content string, author_followers int) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{"interactionId":"interactionId", "username":"interaction.interaction.author.username", "content":"interaction.interaction.content", "author_followers_count":"interaction.twitter.user.followers_count"}' ) TBLPROPERTIES ( 'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' )
  • 67. Demo MongoDB and the Incoming Twitter Dataset T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 68. Adding MongoDB Datasets into the ODI Repository •Define Hive table outside of ODI, using MongoDB storage handler •Select the document elements of interest, project into Hive columns •Add Hive source to Topology if needed, then use Hive RKM to bring in column metadata T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 69. Demo MongoDB Accessed through Hive T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 70. Summary : Data Loading Phase •We’ve now landed log activity from the Rittman Mead website into Hadoop, using Flume •Data arrives as Apache Webserver log files, is then loaded into a Hive table and parsed •Supplemented by social media activity (Twitter) accessed through a MongoDB database •Now we can start processing, analysing, supplementing and working with the dataset… RDBMS Imports T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Loading Stage File Exports ✓ !!!! Processing Stage E : info@rittmanmead.com W : www.rittmanmead.com !!!! Store / Export Stage !!!! Real-Time Logs / Events File / Unstructured Imports RDBMS Exports
  • 71. Lesson 2 : Hadoop & NoSQL Data Loading using Hadoop Tools and ODI12c Mark Rittman, CTO, Rittman Mead SIOUG and HROUG Conferences, Oct 2014 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com