SlideShare ist ein Scribd-Unternehmen logo
1 von 82
Downloaden Sie, um offline zu lesen
Deep-Dive into Big Data ETL with 
ODI12c and Oracle Big Data Connectors 
Mark Rittman, CTO, Rittman Mead 
UKOUG Tech’14 Super Sunday, December 2014 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
About the Speaker 
•Mark Rittman, Co-Founder of Rittman Mead 
•Oracle ACE Director, specialising in Oracle BI&DW 
•14 Years Experience with Oracle Technology 
•Regular columnist for Oracle Magazine 
•Author of two Oracle Press Oracle BI books 
•Oracle Business Intelligence Developers Guide 
•Oracle Exalytics Revealed 
•Writer for Rittman Mead Blog : 
http://www.rittmanmead.com/blog 
•Email : mark.rittman@rittmanmead.com 
•Twitter : @markrittman
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
About Rittman Mead 
•Oracle BI and DW Gold partner 
•Winner of five UKOUG Partner of the Year awards in 2013 - including BI 
•World leading specialist partner for technical excellence, 
solutions delivery and innovation in Oracle BI 
•Approximately 80 consultants worldwide 
•All expert in Oracle BI and DW 
•Offices in US (Atlanta), Europe, Australia and India 
•Skills in broad range of supporting Oracle tools: 
‣OBIEE, OBIA 
‣ODIEE 
‣Essbase, Oracle OLAP 
‣GoldenGate 
‣Endeca
Traditional Data Warehouse / BI Architectures 
•Three-layer architecture - staging, foundation and access/performance 
•All three layers stored in a relational database (Oracle) 
•ETL used to move data from layer-to-layer 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Staging Foundation / 
ODS 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Performance / 
Dimensional 
ETL ETL 
BI Tool (OBIEE) 
with metadata 
layer 
OLAP / In-Memory 
Tool with data load 
into own database 
Direct 
Read 
Data 
Load 
Traditional structured 
data sources 
Data 
Load 
Data 
Load 
Data 
Load 
Traditional Relational Data Warehouse
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Introducing Hadoop 
•A new approach to data processing and data storage 
•Rather than a small number of large, powerful servers, it spreads processing over 
large numbers of small, cheap, redundant servers 
•Spreads the data you’re processing over 
lots of distributed nodes 
•Has scheduling/workload process that sends 
Job Tracker 
parts of a job to each of the nodes 
- a bit like Oracle Parallel Execution 
•And does the processing where the data sits 
- a bit like Exadata storage servers 
•Shared-nothing architecture 
•Low-cost and highly horizontal scalable 
Task Tracker Task Tracker Task Tracker Task Tracker 
Data Node Data Node Task Tracker Task Tracker
Hadoop Tenets : Simplified Distributed Processing 
•Hadoop, through MapReduce, breaks processing down into simple stages 
‣Map : select the columns and values you’re interested in, pass through as key/value pairs 
‣Reduce : aggregate the results 
•Most ETL jobs can be broken down into filtering, 
projecting and aggregating 
•Hadoop then automatically runs job on cluster 
‣Share-nothing small chunks of work 
‣Run the job on the node where the data is 
‣Handle faults etc 
‣Gather the results back in 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Mapper 
Filter, Project 
Mapper 
Filter, Project 
Mapper 
Filter, Project 
Reducer 
Aggregate 
Reducer 
Aggregate 
Output 
One HDFS file per reducer, 
in a directory
Moving Data In, Around and Out of Hadoop 
•Three stages to Hadoop ETL work, with dedicated Apache / other tools 
‣Load : receive files in batch, or in real-time (logs, events) 
‣Transform : process & transform data to answer questions 
‣Store / Export : store in structured form, or export to RDBMS using Sqoop 
RDBMS 
Imports 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Loading 
Stage 
Processing 
Stage 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Store / Export 
Stage 
Real-Time 
Logs / Events 
File / 
Unstructured 
Imports 
File 
Exports 
RDBMS 
Exports
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
“ETL Offloading” 
•Special use-case : offloading low-value, simple ETL work to a Hadoop cluster 
‣Receiving, aggregating, filtering and pre-processing data for an RDBMS data warehouse 
‣Potentially free-up high-value Exadata / RBDMS servers for analytic work
Core Apache Hadoop Tools 
•Apache Hadoop, including MapReduce and HDFS 
‣Scaleable, fault-tolerant file storage for Hadoop 
‣Parallel programming framework for Hadoop 
•Apache Hive 
‣SQL abstraction layer over HDFS 
‣Perform set-based ETL within Hadoop 
•Apache Pig, Spark 
‣Dataflow-type languages over HDFS, Hive etc 
‣Extensible through UDFs, streaming etc 
•Apache Flume, Apache Sqoop, Apache Kafka 
‣Real-time and batch loading into HDFS 
‣Modular, fault-tolerant, wide source/target coverage 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
In the Beginning, There Was … MapReduce 
•Programming model for processing large data sets in parallel on a cluster 
•Not specific to a particular language, but usually written in Java 
•Inspired by the map and reduce functions commonly used in functional programming 
‣Map() performs filtering and sorting 
‣Reduce() aggregates the output of mappers 
‣and a Shuffle() step to redistribute output by keys 
•Resolved several complications of distributed computing: 
‣Allows unlimited computations on unlimited data 
‣Map and reduce functions can be easily distributed 
‣Combined with Hadoop, very network and rack aware, 
minimising network traffic and inherently fault-tolerant 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Mapper 
Filter, Project 
Mapper 
Filter, Project 
Mapper 
Filter, Project 
Reducer 
Aggregate 
Reducer 
Aggregate 
Output 
One HDFS file per reducer, 
in a directory
… But writing MapReduce Code is Hard 
•Typically written in Java 
•Requires programming skills (though Hadoop takes care of parallelism, fault tolerance) 
package net.pascalalma.hadoop; 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Reducer; 
import java.io.IOException; 
public class AllTranslationsReducer extends Reducer<Text, Text, Text, Text> { 
private Text result = new Text(); 
@Override 
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { 
String translations = ""; 
for (Text val : values) { 
translations += "|" + val.toString(); 
} 
result.set(translations); 
context.write(key, result); 
} 
}
Hive as the Hadoop SQL Access Layer 
•Hive can make generating MapReduce easier 
•A query environment over Hadoop/MapReduce to support SQL-like queries 
•Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically 
creates MapReduce jobs against data previously loaded into the Hive HDFS tables 
•Approach used by ODI and OBIEE 
to gain access to Hadoop data 
•Allows Hadoop data to be accessed just like 
any other data source (sort of...) 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
How Hive Provides SQL Access over Hadoop 
•Hive uses a RBDMS metastore to hold 
table and column definitions in schemas 
•Hive tables then map onto HDFS-stored files 
‣Managed tables 
‣External tables 
•Oracle-like query optimizer, compiler, 
executor 
•JDBC and OBDC drivers, 
plus CLI etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Hive Driver 
(Compile 
Optimize, Execute) 
Managed Tables 
/user/hive/warehouse/ 
External Tables 
/user/oracle/ 
/user/movies/data/ 
HDFS 
HDFS or local files 
loaded into Hive HDFS 
area, using HiveQL 
CREATE TABLE 
command 
HDFS files loaded into HDFS 
using external process, then 
mapped into Hive using 
CREATE EXTERNAL TABLE 
command 
Metastore
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Typical Hive Interactions 
•CREATE TABLE test ( 
product_id int, 
product_desc string); 
•SHOW TABLES; 
•CREATE TABLE test2 
AS SELECT * FROM test; 
•SELECT SUM(sales) 
FROM sales_summary; 
•LOAD DATA INPATH ‘/user/mrittman/logs’ INTO TABLE log_entries;
An example Hive Query Session: Connect and Display Table List 
[oracle@bigdatalite ~]$ hive 
Hive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txt 
hive> show tables; 
OK 
dwh_customer 
dwh_customer_tmp 
i_dwh_customer 
ratings 
src_customer 
src_sales_person 
weblog 
weblog_preprocessed 
weblog_sessionized 
Time taken: 2.925 seconds 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Hive Server lists out all 
“tables” that have been 
defined within the Hive 
environment
An example Hive Query Session: Display Table Row Count 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
hive> select count(*) from src_customer; 
Total MapReduce jobs = 1 
Launching Job 1 out of 1 
Number of reduce tasks determined at compile time: 1 
In order to change the average load for a reducer (in bytes): 
set hive.exec.reducers.bytes.per.reducer= 
In order to limit the maximum number of reducers: 
set hive.exec.reducers.max= 
In order to set a constant number of reducers: 
set mapred.reduce.tasks= 
Starting Job = job_201303171815_0003, Tracking URL = 
http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003 
Kill Command = /usr/lib/hadoop-0.20/bin/ 
hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0003 
2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0% 
2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0% 
2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33% 
2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100% 
Ended Job = job_201303171815_0003 
OK 
25 
Time taken: 22.21 seconds 
Request count(*) from table 
Hive server generates 
MapReduce job to “map” table 
key/value pairs, and then 
reduce the results to table 
count 
MapReduce job automatically 
run by Hive Server 
Results returned to user
Hive SerDes - Process Semi-Structured Data 
•Plug-in technology to Hive that allows it to parse data, and access alternatives to HDFS for 
data storage 
•Distributed as JAR file, gives Hive ability to parse semi-structured formats 
•We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns 
CREATE EXTERNAL TABLE apachelog ( 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
host STRING, 
identity STRING, 
user STRING, 
time STRING, 
request STRING, 
status STRING, 
size STRING, 
referer STRING, 
agent STRING) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") 
([^ "]*|"[^"]*"))?", 
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" 
) 
STORED AS TEXTFILE 
LOCATION '/user/root/logs';
Hive and HDFS File Storage 
•Hive tables can either map to a single HDFS file, or a directory of files 
‣Entire contents of directory becomes source for table 
•Directories can have sub-directories, to provide “partitioning” feature for Hive 
‣Only scan and process those subdirectories relevant to query 
•Combined with SerDes, a useful way to process and parse lots of separate log files 
[root@cdh51-node1 ~]# hadoop fs -ls /user/flume/rm_logs/apache_access_combined 
Found 278 items 
-rw-r--r-- 3 root root 672480 2014-10-06 14:31 /user/flume/rm_logs/apache_access_combined/FlumeData.1412601698996 
-rw-r--r-- 3 root root 727711 2014-10-06 14:41 /user/flume/rm_logs/apache_access_combined/FlumeData.1412602299095 
-rw-r--r-- 3 root root 707441 2014-10-06 14:51 /user/flume/rm_logs/apache_access_combined/FlumeData.1412602915327 
-rw-r--r-- 3 root root 807375 2014-10-06 15:02 /user/flume/rm_logs/apache_access_combined/FlumeData.1412603531022 
-rw-r--r-- 3 root root 785963 2014-10-06 15:12 /user/flume/rm_logs/apache_access_combined/FlumeData.1412604138450 
-rw-r--r-- 3 root root 534005 2014-10-06 15:22 /user/flume/rm_logs/apache_access_combined/FlumeData.1412604744386 
-rw-r--r-- 3 root root 634051 2014-10-06 15:32 /user/flume/rm_logs/apache_access_combined/FlumeData.1412605344622 
-rw-r--r-- 3 root root 737031 2014-10-06 15:42 /user/flume/rm_logs/apache_access_combined/FlumeData.1412605968231 
-rw-r--r-- 3 root root 670881 2014-10-06 15:53 /user/flume/rm_logs/apache_access_combined/FlumeData.1412606584235 
-rw-r--r-- 3 root root 800607 2014-10-06 16:03 /user/flume/rm_logs/apache_access_combined/FlumeData.1412607185371 
-rw-r--r-- 3 root root 684562 2014-10-06 16:13 /user/flume/rm_logs/apache_access_combined/FlumeData.1412607794366 
-rw-r--r-- 3 root root 846410 2014-10-06 16:23 /user/flume/rm_logs/apache_access_combined/FlumeData.1412608398806 
-rw-r--r-- 3 root root 576884 2014-10-06 16:33 /user/flume/rm_logs/apache_access_combined/FlumeData.1412608999875 
-rw-r--r-- 3 root root 601540 2014-10-06 16:43 /user/flume/rm_logs/apache_access_combined/FlumeData.1412609607071 
-rw-r--r-- 3 root root 559014 2014-10-06 16:53 /user/flume/rm_logs/apache_access_combined/FlumeData.1412610215067 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Hive Storage Handlers - Access NoSQL Databases 
•MongoDB Hadoop connector allows MongoDB to be accessed via Hive tables 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
CREATE TABLE tweet_data( 
interactionId string, 
username string, 
content string, 
author_followers int) 
ROW FORMAT SERDE 
'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
'mongo.columns.mapping'='{"interactionId":"interactionId", 
"username":"interaction.interaction.author.username", 
"content":"interaction.interaction.content", 
"author_followers_count":"interaction.twitter.user.followers_ 
count"}' 
) 
TBLPROPERTIES ( 
'mongo.uri'='mongodb://cdh51-node1:27017/ 
datasiftmongodb.rm_tweets' 
) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Hive Extensibility through UDFs and UDAFs 
•Extend Hive by adding new computation and aggregation capabilities 
•UDFs (row-based), UDAFs (aggregation) and UDTFs (table functions) 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
add jar target/JsonSplit-1.0-SNAPSHOT.jar; 
create temporary function json_split 
as 'com.pythian.hive.udf.JsonSplitUDF'; 
create table json_example (json string); 
load data local inpath 'split_example.json' 
into table json_example; 
SELECT ex.* FROM json_example 
LATERAL VIEW explode(json_split(json_example.json)) ex; 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
public class JsonSplitUDF extends GenericUDF { 
private StringObjectInspector stringInspector; 
@Override 
public Object evaluate(DeferredObject[] arguments) 
throws HiveException { 
try { 
String jsonString = this.stringInspector. 
getPrimitiveJavaObject(arguments[0].get()); 
ObjectMapper om = new ObjectMapper(); 
ArrayList<Object> root = (ArrayList<Object>) 
om.readValue(jsonString, ArrayList.class); 
ArrayList<Object[]> json = new ArrayList<Object[]> 
(root.size()); 
for (int i=0; i<root.size(); i++){ 
json.add(new Object[]{i, 
om.writeValueAsString(root.get(i))}); 
} 
return json;}}
Hive Extensibility through Streaming 
•TRANSFORM function streams query columns through arbitrary script 
•Use Python, Java etc to transform Hive data when UDFs etc not sufficient 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
add FILE weekday_mapper.py; 
INSERT OVERWRITE TABLE u_data_new 
SELECT 
TRANSFORM (userid, movieid, rating, unixtime) 
USING 'python weekday_mapper.py' 
AS (userid, movieid, rating, weekday) 
FROM u_data; 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
import sys 
import datetime 
for line in sys.stdin: 
line = line.strip() 
userid, movieid, rating, unixtime = line.split('t') 
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() 
print 't'.join([userid, movieid, rating, str(weekday)])
Distributing SerDe JAR Files for Hive across Cluster 
•Hive SerDe and UDF functionality requires additional JARs to be made available to Hive 
•Following steps must be performed across ALL Hadoop nodes: 
‣Add JAR reference to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive.env.sh 
export HIVE_AUX_JARS_PATH=/usr/lib/hive/lib/hive-contrib-0.12.0-cdh5.0.1.jar:$ 
(echo $HIVE_AUX_JARS_PATH… 
[root@bdanode1 hadoop]# ls /usr/lib/hadoop/hive-* 
/usr/lib/hadoop/hive-contrib-0.12.0-cdh5.0.1.jar 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
‣Add JAR file to /usr/lib/hadoop 
‣Restart YARN / MR1 TaskTrackers across cluster
Hive Data Processing Example : Find Top Referers 
•Return the top 5 website URLs linking to the Rittman Mead website 
•Exclude links from our own website 
select referer, count(*) as cnt 
from apachelog 
where substr(referer,1,28) <> '"http://www.rittmanmead.com/' 
group by referer 
order by cnt desc 
limit 5 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
How Hive Turns HiveQL into MapReduce + Hadoop Tasks 
•Two step process; first step filters and groups the data, second sorts and returns top 5 
1 2 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
SQL Considerations : Using Hive vs. Regular Oracle SQL 
•Not all join types are available in Hive - joins must be equality joins 
•No sequences, no primary keys on tables 
•Generally need to stage Oracle or other external data into Hive before joining to it 
•Hive latency - not good for small microbatch-type work 
‣But other alternatives exist - Spark, Impala etc 
•Hive is INSERT / APPEND only - no updates, deletes etc 
‣But HBase may be suitable for CRUD-type loading 
•Don’t assume that HiveQL == Oracle SQL 
‣Test assumptions before committing to platform 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
vs.
Apache Pig : Set-Based Dataflow Language 
•Alternative to Hive, defines data manipulation as dataflow steps (like an execution plan) 
•Start with one or more data sources, add steps to apply filters, group, project columns 
•Generates MapReduce to execute data flow, similar to Hive; extensible through UDFs 
a = load '/user/oracle/pig_demo/marriott_wifi.txt'; 
b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; 
c = group b by word; 
d = foreach c generate COUNT(b), group; 
store d into '/user/oracle/pig_demo/pig_wordcount'; 
[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/pig_demo/pig_wordcount 
Found 2 items 
-rw-r--r-- 1 oracle oracle 0 2014-10-11 11:48 /user/oracle/pig_demo/pig_wordcount/_SUCCESS 
-rw-r--r-- 1 oracle oracle 1965 2014-10-11 11:48 /user/oracle/pig_demo/pig_wordcount/part-r-00000 
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/pig_demo/pig_wordcount/part-r-00000 
2 . 
1 I 
6 a 
... 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
2 
1 
3
Apache Pig Characteristics vs. Hive 
•Ability to load data into a defined schema, or use schema-less (access fields by position) 
•Fields can contain nested fields (tuples) 
•Grouping records on a key doesn’t aggregate them, it creates a nested set of rows in column 
•Uses “lazy execution” - only evaluates data flow once final output has been requests 
•Makes Pig an excellent language for interactive data exploration 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
vs.
Oracle’s Big Data Products 
•Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing 
‣Cloudera Distribution of Hadoop 
‣Cloudera Manager 
‣Open-source R 
‣Oracle NoSQL Database 
‣Oracle Enterprise Linux + Oracle JVM 
‣New - Oracle Big Data SQL 
•Oracle Big Data Connectors 
‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS) 
‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS) 
‣Oracle R Advanced Analytics for Hadoop 
‣Oracle Data Integrator 12c 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Oracle Loader for Hadoop 
•Oracle technology for accessing Hadoop data, and loading it into an Oracle database 
•Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce 
•Direct-path loads into Oracle Database, partitioned and non-partitioned 
•Online and offline loads 
•Key technology for fast load of 
Hadoop results into Oracle DB
Oracle Direct Connector for HDFS 
•Enables HDFS as a data-source for Oracle Database external tables 
•Effectively provides Oracle SQL access over HDFS 
•Supports data query, or import into Oracle DB 
•Treat HDFS-stored files in the same way as regular files 
‣But with HDFS’s low-cost 
‣… and fault-tolerance 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Oracle R Advanced Analytics for Hadoop 
•Add-in to R that extends capability to Hadoop 
•Gives R the ability to create Map and Reduce functions 
•Extends R data frames to include Hive tables 
‣Automatically run R functions on Hadoop 
by using Hive tables as source 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Just Released - Oracle Big Data SQL 
•Part of Oracle Big Data 4.0 (BDA-only) 
‣Also requires Oracle Database 12c, Oracle Exadata Database Machine 
•Extends Oracle Data Dictionary to cover Hive 
•Extends Oracle SQL and SmartScan to Hadoop 
•Extends Oracle Security Model over Hadoop 
‣Fine-grained access control 
‣Data redaction, data masking 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
Exadata 
Storage Servers 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Exadata Database 
Server 
Hadoop 
Cluster 
Oracle Big 
Data SQL 
SQL Queries 
SmartScan SmartScan
Bringing it All Together : Oracle Data Integrator 12c 
•ODI provides an excellent framework for running Hadoop ETL jobs 
‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster 
•Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation 
‣Whilst still preserving RDBMS push-down 
‣Extensible to cover Pig, Spark etc 
•Process orchestration 
•Data quality / error handling 
•Metadata and model-driven 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
ODI and Big Data Integration Example 
•In this example, we’ll show an end-to-end ETL process on Hadoop using ODI12c & BDA 
•Scenario: load webserver log data into Hadoop, process enhance and aggregate, 
then load final summary table into Oracle Database 12c 
‣Process using Hadoop framework 
‣Leverage Big Data Connectors 
‣Metadata-based ETL development 
using ODI12c 
‣Real-world example 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
BigDataLite Demonstration VM 
•Demo / Training VM downloadable from OTN 
•Contains Cloudera Hadoop + Oracle Big Data Connectors + Big Data SQL 
•Similar to setup on Oracle BDA 
•Contains OBIEE enabling technologies: 
‣Apache Hive (SQL access over Hadoop) 
‣Apache HDFS (file storage) 
‣Oracle Direct Connector for HDFS 
‣Oracle R Advanced Analytics for Hadoop 
‣Oracle Big Data SQL 
•Great way to get started with Hadoop 
‣Requires 8GB RAM, modern laptop etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Cloudera Distribution including Hadoop (CDH) 
•Like Linux, you can set up your Hadoop system manually, or use a distribution 
•Key Hadoop distributions include Cloudera CDH, Hortonworks HDP, MapR etc 
•Cloudera CDH is the distribution Oracle use on Big Data Appliance 
‣Provides HDFS and Hadoop framework for BDA 
‣Includes Pig, Hive, Sqoop, Oozie, HBase 
‣Cloudera Impala for real-time SQL access 
‣Cloudera Manager & Hue 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Cloudera Manager and Hue 
•Web-based tools provided with Cloudera CDH 
•Cloudera Manager used for cluster admin, 
maintenance (like Enterprise Manager 
‣Commercial tool developed by Cloudera 
‣Not enabled by default in BigDataLite VM 
•Hue is a developer / analyst tool for 
working with Pig, Hive, Sqoop, HDFS etc 
‣Open source project included in CDH 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
ETL & Data Flow through BDA System 
•Five-step process to load, transform, aggregate and filter incoming log data 
•Leverage ODI’s capabilities where possible 
•Make use of Hadoop power 
+ scalability 
Flume 
Agent 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
posts 
(Hive Table) 
IKM Hive Control Append 
(Hive table join & load into 
target hive table) 
Sqoop extract 
categories_sql_ 
extract 
(Hive Table) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
hive_raw_apache_ 
access_log 
(Hive Table) 
Flume 
Agent 
Apache HTTP 
Server 
Log Files (HDFS) 
Flume Messaging 
TCP Port 4545 
(example) 
IKM File to Hive 
1 using RegEx SerDe 
log_entries_ 
and post_detail 
(Hive Table) 
IKM Hive Control Append 
(Hive table join & load into 
target hive table) 
hive_raw_apache_ 
access_log 
(Hive Table) 
2 3 
Geocoding 
IP>Country list 
(Hive Table) 
IKM Hive Transform 
(Hive streaming through 
Python script) 
4 5 
hive_raw_apache_ 
access_log 
(Hive Table) 
IKM File / Hive to Oracle 
(bulk unload to Oracle DB)
ETL Considerations : Using Hive vs. Regular Oracle SQL 
•Not all join types are available in Hive - joins must be equality joins 
•No sequences, no primary keys on tables 
•Generally need to stage Oracle or other external data into Hive before joining to it 
•Hive latency - not good for small microbatch-type work 
‣But other alternatives exist - Spark, Impala etc 
•Hive is INSERT / APPEND only - no updates, deletes etc 
‣But HBase may be suitable for CRUD-type loading 
•Don’t assume that HiveQL == Oracle SQL 
‣Test assumptions before committing to platform 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
vs.
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Five-Step ETL Process 
1. Take the incoming log files (via Flume) and load into a structured Hive table 
2. Enhance data from that table to include details on authors, posts from other Hive tables 
3. Join to some additional ref. data held in an Oracle database, to add author details 
4. Geocode the log data, so that we have the country for each calling IP address 
5. Output the data in summary form to an Oracle database
Using Flume to Transport Log Files to BDA 
•Apache Flume is the standard way to transport log files from source through to target 
•Initial use-case was webserver log files, but can transport any file from A>B 
•Does not do data transformation, but can send to multiple targets / target types 
•Mechanisms and checks to ensure successful transport of entries 
•Has a concept of “agents”, “sinks” and “channels” 
•Agents collect and forward log data 
•Sinks store it in final destination 
•Channels store log data en-route 
•Simple configuration through INI files 
•Handled outside of ODI12c 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
GoldenGate for Continuous Streaming to Hadoop 
•Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop 
•Leverages GoldenGate & HDFS / Hive Java APIs 
•Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive) 
•Likely to be formal part of GoldenGate in future release - but usable now 
•Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Load Incoming Log Files into Hive Table 
•First step in process is to load the incoming log files into a Hive table 
‣Also need to parse the log entries to extract request, date, IP address etc columns 
‣Hive table can then easily be used in 
downstream transformations 
•Use IKM File to Hive (LOAD DATA) KM 
‣Source can be local files or HDFS 
‣Either load file into Hive HDFS area, 
or leave as external Hive table 
‣Ability to use SerDe to parse file data 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
1
Using IKM File to Hive to Load Web Log File Data into Hive 
•Create mapping to load file source (single column for weblog entries) into Hive table 
•Target Hive table should have column for incoming log row, and parsed columns 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Specifying a SerDe to Parse Incoming Hive Data 
•SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats 
•Distributed as JAR file, gives Hive ability to parse semi-structured formats 
•We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns 
•Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Adding Social Media Datasources to the Hadoop Dataset 
•The log activity from the Rittman Mead website tells us what happened, but not “why” 
•Common customer requirement now is to get a “360 degree view” of their activity 
‣Understand what’s being said about them 
‣External drivers for interest, activity 
‣Understand more about customer intent, opinions 
•One example is to add details of social media mentions, 
likes, tweets and retweets etc to the transactional dataset 
‣Correlate twitter activity with sales increases, drops 
‣Measure impact of social media strategy 
‣Gather and include textual, sentiment, contextual 
data from surveys, media etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Example : Supplement Webserver Log Activity with Twitter Data 
•Datasift provide access to the Twitter “firehose” along with Facebook data, Tumblr etc 
•Developer-friendly APIs and ability to define search terms, keywords etc 
•Pull (historical data) or Push (real-time) delivery using many formats / end-points 
‣Most commonly-used consumption format is JSON, loaded into Redis, MongoDB etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
What is MongoDB? 
•Open-source document-store NoSQL database 
•Flexible data model, each document (record) 
can have its own JSON schema 
•Highly-scalable across multiple nodes (shards) 
•MongoDB databases made up of 
collections of documents 
‣Add new attributes to a document just by using it 
‣Single table (collection) design, no joins etc 
‣Very useful for holding JSON output from web apps 
- for example, twitter data from Datasift
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Hive and MongoDB 
•MongoDB Hadoop connector provides a storage handler for Hive tables 
•Rather than store its data in HDFS, the Hive table uses MongoDB for storage instead 
•Define in SerDe properties the Collection elements you want to access, using dot notation 
•https://github.com/mongodb/mongo-hadoop 
CREATE TABLE tweet_data( 
interactionId string, 
username string, 
content string, 
author_followers int) 
ROW FORMAT SERDE 
'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
'mongo.columns.mapping'='{"interactionId":"interactionId", 
"username":"interaction.interaction.author.username", 
"content":"interaction.interaction.content", 
"author_followers_count":"interaction.twitter.user.followers_count"}' 
) 
TBLPROPERTIES ( 
'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' 
)
Demo 
MongoDB and the Incoming Twitter Dataset 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Adding MongoDB Datasets into the ODI Repository 
•Define Hive table outside of ODI, using MongoDB storage handler 
•Select the document elements of interest, project into Hive columns 
•Add Hive source to Topology if needed, then use Hive RKM to bring in column metadata 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Join to Additional Hive Tables, Transform using HiveQL 
•IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc. 
•INSERT only, no DELETE, UPDATE etc 
•Not all ODI12c mapping operators supported, but basic functionality works OK 
•Use this KM to join to other Hive tables, 
adding more details on post, title etc 
•Perform DISTINCT on join output, load 
into summary Hive table 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
2
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Joining Hive Tables 
•Only equi-joins supported 
•Must use ANSI syntax 
•More complex joins may not produce 
valid HiveQL (subqueries etc)
Filtering, Aggregating and Transforming Within Hive 
•Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mapping 
operators can be added to mapping to manipulate data 
•Generates HiveQL functions, clauses etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Bring in Reference Data from Oracle Database 
•In this third step, additional reference data from Oracle Database needs to be added 
•In theory, should be able to add Oracle-sourced datastores to mapping and join as usual 
•But … Oracle / JDBC-generic LKMs don’t get work with Hive 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
3
Options for Importing Oracle / RDBMS Data into Hadoop 
•Could export RBDMS data to file, and load using IKM File to Hive 
•Oracle Big Data Connectors only export to Oracle, not import to Hadoop 
•Best option is to use Apache Sqoop, and new 
IKM SQL to Hive-HBase-File knowledge module 
•Hadoop-native, automatically runs in parallel 
•Uses native JDBC drivers, or OraOop (for example) 
•Bi-directional in-and-out of Hadoop to RDBMS 
•Run from OS command-line 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Loading RDBMS Data into Hive using Sqoop 
•First step is to stage Oracle data into equivalent Hive table 
•Use special LKM SQL Multi-Connect Global load knowledge module for Oracle source 
‣Passes responsibility for load (extract) to following IKM 
•Then use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive table 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Join Oracle-Sourced Hive Table to Existing Hive Table 
•Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal 
•Filters, aggregation operators etc can be added to mapping if required 
•Use IKM Hive Control Append as integration KM 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
New Option - Using Oracle Big Data SQL 
•Oracle Big Data SQL provides ability for Exadata to reference Hive tables 
•Use feature to create join in Oracle, bringing across Hive data through ORACLE_HIVE table 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Using Hive Streaming and Python for Geocoding Data 
•Another requirement we have is to “geocode” the webserver log entries 
•Allows us to aggregate page views by country 
•Based on the fact that IP ranges can usually be attributed to specific countries 
•Not functionality normally found in Hive etc, but can be done with add-on APIs 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
4
How GeoIP Geocoding Works 
•Uses free Geocoding API and database from Maxmind 
•Convert IP address to an integer 
•Find which integer range our IP address sits within 
•But Hive can’t use BETWEEN in a join… 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Solution : IKM Hive Transform 
•IKM Hive Transform can pass the output of a Hive SELECT statement through 
a perl, python, shell etc script to transform content 
•Uses Hive TRANSFORM … USING … AS functionality 
hive> add file file:///tmp/add_countries.py; 
Added resource: file:///tmp/add_countries.py 
hive> select transform (hostname,request_date,post_id,title,author,category) 
> using 'add_countries.py' 
> as (hostname,request_date,post_id,title,author,category,country) 
> from access_per_post_categories; 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Creating the Python Script for Hive Streaming 
•Solution requires a Python API to be installed on all Hadoop nodes, along with geocode DB 
wget —no-check-certificate https://raw.github.com/pypa/pip/master/contrib/get-pip.py 
python get-pip.py 
pip install pygeoip 
•Python script then parses incoming stdin lines using tab-separation of fields, outputs same 
(but with extra field for the country) 
#!/usr/bin/python 
import sys 
sys.path.append('/usr/lib/python2.6/site-packages/') 
import pygeoip 
gi = pygeoip.GeoIP('/tmp/GeoIP.dat') 
for line in sys.stdin: 
line = line.rstrip() 
hostname,request_date,post_id,title,author,category = line.split('t') 
country = gi.country_name_by_addr(hostname) 
print hostname+'t'+request_date+'t'+post_id+'t'+title+'t'+author 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
+'t'+country+'t'+category
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Setting up the Mapping 
•Map source Hive table to target, which includes column for extra “country” column 
•Copy script + GeoIP.dat file to every node’s /tmp directory 
•Ensure all Python APIs and libraries are installed on each Hadoop node
Configuring IKM Hive Transform 
•TRANSFORM_SCRIPT_NAME specifies name of 
script, and path to script 
•TRANSFORM_SCRIPT has issues with parsing; 
do not use, leave blank and KM will use existing one 
•Optional ability to specify sort and distribution 
columns (can be compound) 
•Leave other options at default 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Executing the Mapping 
•KM automatically registers the script with Hive (which caches it on all nodes) 
•HiveQL output then runs the contents of the first Hive table through the script, outputting 
results to target table
Bulk Unload Summary Data to Oracle Database 
•Final requirement is to unload final Hive table contents to Oracle Database 
•Several use-cases for this: 
•Use Hadoop / BDA for ETL offloading 
•Use analysis capabilities of BDA, but then output results to RDBMS data mart or DW 
•Permit use of more advanced SQL query tools 
•Share results with other applications 
•Can use Sqoop for this, or use Oracle Big Data Connectors 
•Fast bulk unload, or transparent Oracle access to Hive 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
5
IKM File/Hive to Oracle (OLH/ODCH) 
•KM for accessing HDFS/Hive data from Oracle 
•Either sets up ODCH connectivity, or bulk-unloads via OLH 
•Map from HDFS or Hive source to Oracle tables (via Oracle technology in Topology) 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Configuring the KM Physical Settings 
•For the access table in Physical view, change LKM to LKM SQL Multi-Connect 
•Delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect 
IKM such as IKM File/Hive to Oracle 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Create Package to Sequence ETL Steps 
•Define package (or load plan) within ODI12c to orchestrate the process 
•Call package / load plan execution from command-line, web service call, or schedule 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Execute Overall Package 
•Each step executed in sequence 
•End-to-end ETL process, using ODI12c’s metadata-driven development process, 
data quality handing, heterogenous connectivity, but Hadoop-native processing
Coming Soon… 
Going beyond Hive, Pig & MapReduce 
& Oracle’s New Big Data Products 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Hadoop 2.0 : YARN, Spark and Tez 
•Hadoop 2.0 breaks the link between Hadoop and MapReduce 
•Separates out resource management from job scheduling 
•Introduces a new component - YARN 
‣“Yet Another Resource Manager” 
‣Backwards-compatible with Hadoop 1.0 
•Makes Hadoop and YARN more of an “OS” 
•Makes it possible to run other processing 
types on Hadoop 
‣For example, Apache Spark, Apache Tez 
•Now used in CDH5, Hadoop 2.0 etc 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Apache Tez 
•Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc 
•Models processing as an entire data flow graph (DAG), rather than separate job steps 
‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems 
‣Dataflow steps pass data between them as streams, rather than writing/reading from disk 
•Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez 
•Favoured In-memory / Hive v2 
route by Hortonworks 
Pig/Hive - MR 
Pig/Hive - Tez
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Apache Spark 
•Another DAG execution engine running on YARN 
•More mature than TEZ, with richer API and more vendor support 
•Uses concept of an RDD (Resilient Distributed Dataset) 
‣RDDs like tables or Pig relations, but can be cached in-memory 
‣Great for in-memory transformations, or iterative/cyclic processes 
•Spark jobs comprise of a DAG of tasks operating on RDDs 
•Access through Scala, Python or Java APIs 
•Related projects include 
‣Spark SQL 
‣Spark Streaming
Apache Spark Example : Simple Log Analysis 
scala> val logfile = sc.textFile("logs/access_log") 
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) called with curMem=234759, maxMem=309225062 
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 75.5 KB, free 294.6 MB) 
logfile: org.apache.spark.rdd.RDD[String] = MappedRDD[31] at textFile at <console>:15 
scala> logfile.count() 
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1 
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1 
... 
14/05/12 21:19:06 INFO SparkContext: Job finished: count at <console>:18, took 0.192536694 s 
res7: Long = 154563 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
•Load logfile into RDD, do row count 
•Load logfile into RDD and cache it, create another RDD from it filtered on /biapps11g/ 
scala> val logfile = sc.textFile("logs/access_log").cache 
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/")) 
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17 
scala> biapps11g.count() 
... 
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s 
res9: Long = 403
Apache Spark Example : Simple Log Analysis 
•Import a log parsing library, then use it to generate a list of URIs creating 404 errors 
scala> import com.alvinalexander.accesslogparser._ 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
val p = new AccessLogParser 
def getStatusCode(line: Option[AccessLogRecord]) = { 
line match { 
case Some(l) => l.httpStatusCode 
case None => "0" 
} 
} 
def getRequest(rawAccessLogString: String): Option[String] = { 
val accessLogRecordOption = p.parseRecord(rawAccessLogString) 
accessLogRecordOption match { 
case Some(rec) => Some(rec.request) 
case None => None 
} 
} 
def extractUriFromRequest(requestField: String) = requestField.split(" ")(1) 
log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).count 
val recs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)) 
val distinctRecs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404") 
.map(getRequest(_)) 
.collect { case Some(requestField) => requestField } 
.map(extractUriFromRequest(_)) 
.distinct 
distinctRecs.count 
distinctRecs.foreach(println)
Coming Soon : Oracle Big Data Discovery 
•Combining of Endeca Server search, analysis and visualisation capabilities 
with Apache Spark data munging and transformation 
‣Analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based 
transformation engine 
‣Create a catalog of the data on 
your Hadoop cluster, then search 
that catalog using Endeca Server 
‣Create recommendations of other 
datasets, based on what 
you’re looking at now 
‣Visualize your datasets, 
discover new insights 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
Coming Soon : Oracle Data Enrichment Cloud Service 
•Cloud-based service for loading, enriching, cleansing and supplementing Hadoop data 
•Part of the Oracle Data Integration product family 
•Used up-stream from Big Data Discovery 
•Aims to solve the “data quality problem” for Hadoop 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Conclusions 
•Hadoop, and the Oracle Big Data Appliance, is an excellent platform for data capture, 
analysis and processing 
•Hadoop tools such as Hive, Sqoop, MapReduce and Pig provide means to process and 
analyse data in parallel, using languages + approach familiar to Oracle developers 
•ODI12c provides several benefits when working with ETL and data loading on Hadoop 
‣Metadata-driven design; data quality handling; KMs to handle technical complexity 
•Oracle Data Integrator Adapter for Hadoop provides several KMs for Hadoop sources 
•In this presentation, we’ve seen an end-to-end example of big data ETL using ODI 
‣The power of Hadoop and BDA, with the ETL orchestration of ODI12c
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com 
Thank You for Attending! 
•Thank you for attending this presentation, and more information can be found at http:// 
www.rittmanmead.com 
•Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com 
•Look out for our book, “Oracle Business Intelligence Developers Guide” out now! 
•Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)
Deep-Dive into Big Data ETL with 
ODI12c and Oracle Big Data Connectors 
Mark Rittman, CTO, Rittman Mead 
UKOUG Tech’14 Super Sunday, December 2014 
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) 
E : info@rittmanmead.com 
W : www.rittmanmead.com

Weitere ähnliche Inhalte

Was ist angesagt?

Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
DataWorks Summit
 
Big Data Conference April 2015
Big Data Conference April 2015Big Data Conference April 2015
Big Data Conference April 2015
Aaron Benz
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 

Was ist angesagt? (20)

2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Dealing with Changed Data in Hadoop
Dealing with Changed Data in HadoopDealing with Changed Data in Hadoop
Dealing with Changed Data in Hadoop
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Big Data Conference April 2015
Big Data Conference April 2015Big Data Conference April 2015
Big Data Conference April 2015
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
Ten tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_TableauTen tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_Tableau
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Get the most out of Oracle Data Guard - OOW version
Get the most out of Oracle Data Guard - OOW versionGet the most out of Oracle Data Guard - OOW version
Get the most out of Oracle Data Guard - OOW version
 
Apache drill
Apache drillApache drill
Apache drill
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Oracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAsOracle GoldenGate for Oracle DBAs
Oracle GoldenGate for Oracle DBAs
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Oracle Active Data Guard and Global Data Services in Action!
Oracle Active Data Guard and Global Data Services in Action!Oracle Active Data Guard and Global Data Services in Action!
Oracle Active Data Guard and Global Data Services in Action!
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Effective Oracle Home Management - UKOUG_Tech18
Effective Oracle Home Management  - UKOUG_Tech18Effective Oracle Home Management  - UKOUG_Tech18
Effective Oracle Home Management - UKOUG_Tech18
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 

Ähnlich wie UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c

Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Pentaho
 

Ähnlich wie UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c (20)

Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data ConnectorsDeep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
 
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c AdaptorsReal-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
 
Deploying OBIEE in the Cloud - Oracle Openworld 2014
Deploying OBIEE in the Cloud - Oracle Openworld 2014Deploying OBIEE in the Cloud - Oracle Openworld 2014
Deploying OBIEE in the Cloud - Oracle Openworld 2014
 
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)
TimesTen - Beyond the Summary Advisor (ODTUG KScope'14)
 
UKOUG Tech 15 - Migration from Oracle Warehouse Builder to Oracle Data Integr...
UKOUG Tech 15 - Migration from Oracle Warehouse Builder to Oracle Data Integr...UKOUG Tech 15 - Migration from Oracle Warehouse Builder to Oracle Data Integr...
UKOUG Tech 15 - Migration from Oracle Warehouse Builder to Oracle Data Integr...
 
Using Endeca with Oracle Exalytics - Oracle France BI Customer Event, October...
Using Endeca with Oracle Exalytics - Oracle France BI Customer Event, October...Using Endeca with Oracle Exalytics - Oracle France BI Customer Event, October...
Using Endeca with Oracle Exalytics - Oracle France BI Customer Event, October...
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdfDataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
DataEng Mad - 03.03.2020 - Tibero 30-min Presentation.pdf
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
ODI 11g in the Enterprise - BIWA 2013
ODI 11g in the Enterprise - BIWA 2013ODI 11g in the Enterprise - BIWA 2013
ODI 11g in the Enterprise - BIWA 2013
 
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
SQL-on-Hadoop for Analytics + BI: What Are My Options, What's the Future?
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)Demystifying Data Warehouse as a Service (DWaaS)
Demystifying Data Warehouse as a Service (DWaaS)
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
 
2-in-1 : RPD Magic and Hyperion Planning "Adapter"
2-in-1 : RPD Magic and Hyperion Planning "Adapter"2-in-1 : RPD Magic and Hyperion Planning "Adapter"
2-in-1 : RPD Magic and Hyperion Planning "Adapter"
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 

Mehr von Mark Rittman

Mehr von Mark Rittman (19)

The Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data PlatformsThe Future of Analytics, Data Integration and BI on Big Data Platforms
The Future of Analytics, Data Integration and BI on Big Data Platforms
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle CloudOTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
OTN EMEA Tour 2016 : Deploying Full BI Platforms to Oracle Cloud
 
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive AnalyticsBig Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
Big Data for Oracle Devs - Towards Spark, Real-Time and Predictive Analytics
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle Cloud
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
 
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
 

Kürzlich hochgeladen

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Kürzlich hochgeladen (20)

Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 

UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c

  • 1. Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead UKOUG Tech’14 Super Sunday, December 2014 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 2. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com About the Speaker •Mark Rittman, Co-Founder of Rittman Mead •Oracle ACE Director, specialising in Oracle BI&DW •14 Years Experience with Oracle Technology •Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books •Oracle Business Intelligence Developers Guide •Oracle Exalytics Revealed •Writer for Rittman Mead Blog : http://www.rittmanmead.com/blog •Email : mark.rittman@rittmanmead.com •Twitter : @markrittman
  • 3. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com About Rittman Mead •Oracle BI and DW Gold partner •Winner of five UKOUG Partner of the Year awards in 2013 - including BI •World leading specialist partner for technical excellence, solutions delivery and innovation in Oracle BI •Approximately 80 consultants worldwide •All expert in Oracle BI and DW •Offices in US (Atlanta), Europe, Australia and India •Skills in broad range of supporting Oracle tools: ‣OBIEE, OBIA ‣ODIEE ‣Essbase, Oracle OLAP ‣GoldenGate ‣Endeca
  • 4. Traditional Data Warehouse / BI Architectures •Three-layer architecture - staging, foundation and access/performance •All three layers stored in a relational database (Oracle) •ETL used to move data from layer-to-layer T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Staging Foundation / ODS E : info@rittmanmead.com W : www.rittmanmead.com Performance / Dimensional ETL ETL BI Tool (OBIEE) with metadata layer OLAP / In-Memory Tool with data load into own database Direct Read Data Load Traditional structured data sources Data Load Data Load Data Load Traditional Relational Data Warehouse
  • 5. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Introducing Hadoop •A new approach to data processing and data storage •Rather than a small number of large, powerful servers, it spreads processing over large numbers of small, cheap, redundant servers •Spreads the data you’re processing over lots of distributed nodes •Has scheduling/workload process that sends Job Tracker parts of a job to each of the nodes - a bit like Oracle Parallel Execution •And does the processing where the data sits - a bit like Exadata storage servers •Shared-nothing architecture •Low-cost and highly horizontal scalable Task Tracker Task Tracker Task Tracker Task Tracker Data Node Data Node Task Tracker Task Tracker
  • 6. Hadoop Tenets : Simplified Distributed Processing •Hadoop, through MapReduce, breaks processing down into simple stages ‣Map : select the columns and values you’re interested in, pass through as key/value pairs ‣Reduce : aggregate the results •Most ETL jobs can be broken down into filtering, projecting and aggregating •Hadoop then automatically runs job on cluster ‣Share-nothing small chunks of work ‣Run the job on the node where the data is ‣Handle faults etc ‣Gather the results back in T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Mapper Filter, Project Mapper Filter, Project Mapper Filter, Project Reducer Aggregate Reducer Aggregate Output One HDFS file per reducer, in a directory
  • 7. Moving Data In, Around and Out of Hadoop •Three stages to Hadoop ETL work, with dedicated Apache / other tools ‣Load : receive files in batch, or in real-time (logs, events) ‣Transform : process & transform data to answer questions ‣Store / Export : store in structured form, or export to RDBMS using Sqoop RDBMS Imports T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Loading Stage Processing Stage E : info@rittmanmead.com W : www.rittmanmead.com Store / Export Stage Real-Time Logs / Events File / Unstructured Imports File Exports RDBMS Exports
  • 8. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com “ETL Offloading” •Special use-case : offloading low-value, simple ETL work to a Hadoop cluster ‣Receiving, aggregating, filtering and pre-processing data for an RDBMS data warehouse ‣Potentially free-up high-value Exadata / RBDMS servers for analytic work
  • 9. Core Apache Hadoop Tools •Apache Hadoop, including MapReduce and HDFS ‣Scaleable, fault-tolerant file storage for Hadoop ‣Parallel programming framework for Hadoop •Apache Hive ‣SQL abstraction layer over HDFS ‣Perform set-based ETL within Hadoop •Apache Pig, Spark ‣Dataflow-type languages over HDFS, Hive etc ‣Extensible through UDFs, streaming etc •Apache Flume, Apache Sqoop, Apache Kafka ‣Real-time and batch loading into HDFS ‣Modular, fault-tolerant, wide source/target coverage T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 10. In the Beginning, There Was … MapReduce •Programming model for processing large data sets in parallel on a cluster •Not specific to a particular language, but usually written in Java •Inspired by the map and reduce functions commonly used in functional programming ‣Map() performs filtering and sorting ‣Reduce() aggregates the output of mappers ‣and a Shuffle() step to redistribute output by keys •Resolved several complications of distributed computing: ‣Allows unlimited computations on unlimited data ‣Map and reduce functions can be easily distributed ‣Combined with Hadoop, very network and rack aware, minimising network traffic and inherently fault-tolerant T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Mapper Filter, Project Mapper Filter, Project Mapper Filter, Project Reducer Aggregate Reducer Aggregate Output One HDFS file per reducer, in a directory
  • 11. … But writing MapReduce Code is Hard •Typically written in Java •Requires programming skills (though Hadoop takes care of parallelism, fault tolerance) package net.pascalalma.hadoop; T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class AllTranslationsReducer extends Reducer<Text, Text, Text, Text> { private Text result = new Text(); @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { String translations = ""; for (Text val : values) { translations += "|" + val.toString(); } result.set(translations); context.write(key, result); } }
  • 12. Hive as the Hadoop SQL Access Layer •Hive can make generating MapReduce easier •A query environment over Hadoop/MapReduce to support SQL-like queries •Hive server accepts HiveQL queries via HiveODBC or HiveJDBC, automatically creates MapReduce jobs against data previously loaded into the Hive HDFS tables •Approach used by ODI and OBIEE to gain access to Hadoop data •Allows Hadoop data to be accessed just like any other data source (sort of...) T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 13. How Hive Provides SQL Access over Hadoop •Hive uses a RBDMS metastore to hold table and column definitions in schemas •Hive tables then map onto HDFS-stored files ‣Managed tables ‣External tables •Oracle-like query optimizer, compiler, executor •JDBC and OBDC drivers, plus CLI etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Hive Driver (Compile Optimize, Execute) Managed Tables /user/hive/warehouse/ External Tables /user/oracle/ /user/movies/data/ HDFS HDFS or local files loaded into Hive HDFS area, using HiveQL CREATE TABLE command HDFS files loaded into HDFS using external process, then mapped into Hive using CREATE EXTERNAL TABLE command Metastore
  • 14. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Typical Hive Interactions •CREATE TABLE test ( product_id int, product_desc string); •SHOW TABLES; •CREATE TABLE test2 AS SELECT * FROM test; •SELECT SUM(sales) FROM sales_summary; •LOAD DATA INPATH ‘/user/mrittman/logs’ INTO TABLE log_entries;
  • 15. An example Hive Query Session: Connect and Display Table List [oracle@bigdatalite ~]$ hive Hive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txt hive> show tables; OK dwh_customer dwh_customer_tmp i_dwh_customer ratings src_customer src_sales_person weblog weblog_preprocessed weblog_sessionized Time taken: 2.925 seconds T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Hive Server lists out all “tables” that have been defined within the Hive environment
  • 16. An example Hive Query Session: Display Table Row Count T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com hive> select count(*) from src_customer; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer= In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201303171815_0003, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003 Kill Command = /usr/lib/hadoop-0.20/bin/ hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0003 2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0% 2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0% 2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33% 2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100% Ended Job = job_201303171815_0003 OK 25 Time taken: 22.21 seconds Request count(*) from table Hive server generates MapReduce job to “map” table key/value pairs, and then reduce the results to table count MapReduce job automatically run by Hive Server Results returned to user
  • 17. Hive SerDes - Process Semi-Structured Data •Plug-in technology to Hive that allows it to parse data, and access alternatives to HDFS for data storage •Distributed as JAR file, gives Hive ability to parse semi-structured formats •We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns CREATE EXTERNAL TABLE apachelog ( T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|[[^]]*]) ([^ "]*|"[^"]*") (-|[0-9]*) (-|[0-9]*)(?: ([^ "]*|"[^"]*") ([^ "]*|"[^"]*"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/root/logs';
  • 18. Hive and HDFS File Storage •Hive tables can either map to a single HDFS file, or a directory of files ‣Entire contents of directory becomes source for table •Directories can have sub-directories, to provide “partitioning” feature for Hive ‣Only scan and process those subdirectories relevant to query •Combined with SerDes, a useful way to process and parse lots of separate log files [root@cdh51-node1 ~]# hadoop fs -ls /user/flume/rm_logs/apache_access_combined Found 278 items -rw-r--r-- 3 root root 672480 2014-10-06 14:31 /user/flume/rm_logs/apache_access_combined/FlumeData.1412601698996 -rw-r--r-- 3 root root 727711 2014-10-06 14:41 /user/flume/rm_logs/apache_access_combined/FlumeData.1412602299095 -rw-r--r-- 3 root root 707441 2014-10-06 14:51 /user/flume/rm_logs/apache_access_combined/FlumeData.1412602915327 -rw-r--r-- 3 root root 807375 2014-10-06 15:02 /user/flume/rm_logs/apache_access_combined/FlumeData.1412603531022 -rw-r--r-- 3 root root 785963 2014-10-06 15:12 /user/flume/rm_logs/apache_access_combined/FlumeData.1412604138450 -rw-r--r-- 3 root root 534005 2014-10-06 15:22 /user/flume/rm_logs/apache_access_combined/FlumeData.1412604744386 -rw-r--r-- 3 root root 634051 2014-10-06 15:32 /user/flume/rm_logs/apache_access_combined/FlumeData.1412605344622 -rw-r--r-- 3 root root 737031 2014-10-06 15:42 /user/flume/rm_logs/apache_access_combined/FlumeData.1412605968231 -rw-r--r-- 3 root root 670881 2014-10-06 15:53 /user/flume/rm_logs/apache_access_combined/FlumeData.1412606584235 -rw-r--r-- 3 root root 800607 2014-10-06 16:03 /user/flume/rm_logs/apache_access_combined/FlumeData.1412607185371 -rw-r--r-- 3 root root 684562 2014-10-06 16:13 /user/flume/rm_logs/apache_access_combined/FlumeData.1412607794366 -rw-r--r-- 3 root root 846410 2014-10-06 16:23 /user/flume/rm_logs/apache_access_combined/FlumeData.1412608398806 -rw-r--r-- 3 root root 576884 2014-10-06 16:33 /user/flume/rm_logs/apache_access_combined/FlumeData.1412608999875 -rw-r--r-- 3 root root 601540 2014-10-06 16:43 /user/flume/rm_logs/apache_access_combined/FlumeData.1412609607071 -rw-r--r-- 3 root root 559014 2014-10-06 16:53 /user/flume/rm_logs/apache_access_combined/FlumeData.1412610215067 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 19. Hive Storage Handlers - Access NoSQL Databases •MongoDB Hadoop connector allows MongoDB to be accessed via Hive tables T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) CREATE TABLE tweet_data( interactionId string, username string, content string, author_followers int) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{"interactionId":"interactionId", "username":"interaction.interaction.author.username", "content":"interaction.interaction.content", "author_followers_count":"interaction.twitter.user.followers_ count"}' ) TBLPROPERTIES ( 'mongo.uri'='mongodb://cdh51-node1:27017/ datasiftmongodb.rm_tweets' ) E : info@rittmanmead.com W : www.rittmanmead.com
  • 20. Hive Extensibility through UDFs and UDAFs •Extend Hive by adding new computation and aggregation capabilities •UDFs (row-based), UDAFs (aggregation) and UDTFs (table functions) T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) add jar target/JsonSplit-1.0-SNAPSHOT.jar; create temporary function json_split as 'com.pythian.hive.udf.JsonSplitUDF'; create table json_example (json string); load data local inpath 'split_example.json' into table json_example; SELECT ex.* FROM json_example LATERAL VIEW explode(json_split(json_example.json)) ex; E : info@rittmanmead.com W : www.rittmanmead.com public class JsonSplitUDF extends GenericUDF { private StringObjectInspector stringInspector; @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { try { String jsonString = this.stringInspector. getPrimitiveJavaObject(arguments[0].get()); ObjectMapper om = new ObjectMapper(); ArrayList<Object> root = (ArrayList<Object>) om.readValue(jsonString, ArrayList.class); ArrayList<Object[]> json = new ArrayList<Object[]> (root.size()); for (int i=0; i<root.size(); i++){ json.add(new Object[]{i, om.writeValueAsString(root.get(i))}); } return json;}}
  • 21. Hive Extensibility through Streaming •TRANSFORM function streams query columns through arbitrary script •Use Python, Java etc to transform Hive data when UDFs etc not sufficient T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; E : info@rittmanmead.com W : www.rittmanmead.com import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print 't'.join([userid, movieid, rating, str(weekday)])
  • 22. Distributing SerDe JAR Files for Hive across Cluster •Hive SerDe and UDF functionality requires additional JARs to be made available to Hive •Following steps must be performed across ALL Hadoop nodes: ‣Add JAR reference to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive.env.sh export HIVE_AUX_JARS_PATH=/usr/lib/hive/lib/hive-contrib-0.12.0-cdh5.0.1.jar:$ (echo $HIVE_AUX_JARS_PATH… [root@bdanode1 hadoop]# ls /usr/lib/hadoop/hive-* /usr/lib/hadoop/hive-contrib-0.12.0-cdh5.0.1.jar T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com ‣Add JAR file to /usr/lib/hadoop ‣Restart YARN / MR1 TaskTrackers across cluster
  • 23. Hive Data Processing Example : Find Top Referers •Return the top 5 website URLs linking to the Rittman Mead website •Exclude links from our own website select referer, count(*) as cnt from apachelog where substr(referer,1,28) <> '"http://www.rittmanmead.com/' group by referer order by cnt desc limit 5 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 24. How Hive Turns HiveQL into MapReduce + Hadoop Tasks •Two step process; first step filters and groups the data, second sorts and returns top 5 1 2 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 25. SQL Considerations : Using Hive vs. Regular Oracle SQL •Not all join types are available in Hive - joins must be equality joins •No sequences, no primary keys on tables •Generally need to stage Oracle or other external data into Hive before joining to it •Hive latency - not good for small microbatch-type work ‣But other alternatives exist - Spark, Impala etc •Hive is INSERT / APPEND only - no updates, deletes etc ‣But HBase may be suitable for CRUD-type loading •Don’t assume that HiveQL == Oracle SQL ‣Test assumptions before committing to platform T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com vs.
  • 26. Apache Pig : Set-Based Dataflow Language •Alternative to Hive, defines data manipulation as dataflow steps (like an execution plan) •Start with one or more data sources, add steps to apply filters, group, project columns •Generates MapReduce to execute data flow, similar to Hive; extensible through UDFs a = load '/user/oracle/pig_demo/marriott_wifi.txt'; b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word; c = group b by word; d = foreach c generate COUNT(b), group; store d into '/user/oracle/pig_demo/pig_wordcount'; [oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/pig_demo/pig_wordcount Found 2 items -rw-r--r-- 1 oracle oracle 0 2014-10-11 11:48 /user/oracle/pig_demo/pig_wordcount/_SUCCESS -rw-r--r-- 1 oracle oracle 1965 2014-10-11 11:48 /user/oracle/pig_demo/pig_wordcount/part-r-00000 [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/pig_demo/pig_wordcount/part-r-00000 2 . 1 I 6 a ... T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com 2 1 3
  • 27. Apache Pig Characteristics vs. Hive •Ability to load data into a defined schema, or use schema-less (access fields by position) •Fields can contain nested fields (tuples) •Grouping records on a key doesn’t aggregate them, it creates a nested set of rows in column •Uses “lazy execution” - only evaluates data flow once final output has been requests •Makes Pig an excellent language for interactive data exploration T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com vs.
  • 28. Oracle’s Big Data Products •Oracle Big Data Appliance - Engineered System for Big Data Acquisition and Processing ‣Cloudera Distribution of Hadoop ‣Cloudera Manager ‣Open-source R ‣Oracle NoSQL Database ‣Oracle Enterprise Linux + Oracle JVM ‣New - Oracle Big Data SQL •Oracle Big Data Connectors ‣Oracle Loader for Hadoop (Hadoop > Oracle RBDMS) ‣Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS) ‣Oracle R Advanced Analytics for Hadoop ‣Oracle Data Integrator 12c T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 29. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Oracle Loader for Hadoop •Oracle technology for accessing Hadoop data, and loading it into an Oracle database •Pushes data transformation, “heavy lifting” to the Hadoop cluster, using MapReduce •Direct-path loads into Oracle Database, partitioned and non-partitioned •Online and offline loads •Key technology for fast load of Hadoop results into Oracle DB
  • 30. Oracle Direct Connector for HDFS •Enables HDFS as a data-source for Oracle Database external tables •Effectively provides Oracle SQL access over HDFS •Supports data query, or import into Oracle DB •Treat HDFS-stored files in the same way as regular files ‣But with HDFS’s low-cost ‣… and fault-tolerance T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 31. Oracle R Advanced Analytics for Hadoop •Add-in to R that extends capability to Hadoop •Gives R the ability to create Map and Reduce functions •Extends R data frames to include Hive tables ‣Automatically run R functions on Hadoop by using Hive tables as source T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 32. Just Released - Oracle Big Data SQL •Part of Oracle Big Data 4.0 (BDA-only) ‣Also requires Oracle Database 12c, Oracle Exadata Database Machine •Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) Exadata Storage Servers E : info@rittmanmead.com W : www.rittmanmead.com Exadata Database Server Hadoop Cluster Oracle Big Data SQL SQL Queries SmartScan SmartScan
  • 33. Bringing it All Together : Oracle Data Integrator 12c •ODI provides an excellent framework for running Hadoop ETL jobs ‣ELT approach pushes transformations down to Hadoop - leveraging power of cluster •Hive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformation ‣Whilst still preserving RDBMS push-down ‣Extensible to cover Pig, Spark etc •Process orchestration •Data quality / error handling •Metadata and model-driven T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 34. ODI and Big Data Integration Example •In this example, we’ll show an end-to-end ETL process on Hadoop using ODI12c & BDA •Scenario: load webserver log data into Hadoop, process enhance and aggregate, then load final summary table into Oracle Database 12c ‣Process using Hadoop framework ‣Leverage Big Data Connectors ‣Metadata-based ETL development using ODI12c ‣Real-world example T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 35. BigDataLite Demonstration VM •Demo / Training VM downloadable from OTN •Contains Cloudera Hadoop + Oracle Big Data Connectors + Big Data SQL •Similar to setup on Oracle BDA •Contains OBIEE enabling technologies: ‣Apache Hive (SQL access over Hadoop) ‣Apache HDFS (file storage) ‣Oracle Direct Connector for HDFS ‣Oracle R Advanced Analytics for Hadoop ‣Oracle Big Data SQL •Great way to get started with Hadoop ‣Requires 8GB RAM, modern laptop etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 36. Cloudera Distribution including Hadoop (CDH) •Like Linux, you can set up your Hadoop system manually, or use a distribution •Key Hadoop distributions include Cloudera CDH, Hortonworks HDP, MapR etc •Cloudera CDH is the distribution Oracle use on Big Data Appliance ‣Provides HDFS and Hadoop framework for BDA ‣Includes Pig, Hive, Sqoop, Oozie, HBase ‣Cloudera Impala for real-time SQL access ‣Cloudera Manager & Hue T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 37. Cloudera Manager and Hue •Web-based tools provided with Cloudera CDH •Cloudera Manager used for cluster admin, maintenance (like Enterprise Manager ‣Commercial tool developed by Cloudera ‣Not enabled by default in BigDataLite VM •Hue is a developer / analyst tool for working with Pig, Hive, Sqoop, HDFS etc ‣Open source project included in CDH T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 38. ETL & Data Flow through BDA System •Five-step process to load, transform, aggregate and filter incoming log data •Leverage ODI’s capabilities where possible •Make use of Hadoop power + scalability Flume Agent T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) posts (Hive Table) IKM Hive Control Append (Hive table join & load into target hive table) Sqoop extract categories_sql_ extract (Hive Table) E : info@rittmanmead.com W : www.rittmanmead.com hive_raw_apache_ access_log (Hive Table) Flume Agent Apache HTTP Server Log Files (HDFS) Flume Messaging TCP Port 4545 (example) IKM File to Hive 1 using RegEx SerDe log_entries_ and post_detail (Hive Table) IKM Hive Control Append (Hive table join & load into target hive table) hive_raw_apache_ access_log (Hive Table) 2 3 Geocoding IP>Country list (Hive Table) IKM Hive Transform (Hive streaming through Python script) 4 5 hive_raw_apache_ access_log (Hive Table) IKM File / Hive to Oracle (bulk unload to Oracle DB)
  • 39. ETL Considerations : Using Hive vs. Regular Oracle SQL •Not all join types are available in Hive - joins must be equality joins •No sequences, no primary keys on tables •Generally need to stage Oracle or other external data into Hive before joining to it •Hive latency - not good for small microbatch-type work ‣But other alternatives exist - Spark, Impala etc •Hive is INSERT / APPEND only - no updates, deletes etc ‣But HBase may be suitable for CRUD-type loading •Don’t assume that HiveQL == Oracle SQL ‣Test assumptions before committing to platform T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com vs.
  • 40. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Five-Step ETL Process 1. Take the incoming log files (via Flume) and load into a structured Hive table 2. Enhance data from that table to include details on authors, posts from other Hive tables 3. Join to some additional ref. data held in an Oracle database, to add author details 4. Geocode the log data, so that we have the country for each calling IP address 5. Output the data in summary form to an Oracle database
  • 41. Using Flume to Transport Log Files to BDA •Apache Flume is the standard way to transport log files from source through to target •Initial use-case was webserver log files, but can transport any file from A>B •Does not do data transformation, but can send to multiple targets / target types •Mechanisms and checks to ensure successful transport of entries •Has a concept of “agents”, “sinks” and “channels” •Agents collect and forward log data •Sinks store it in final destination •Channels store log data en-route •Simple configuration through INI files •Handled outside of ODI12c T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 42. GoldenGate for Continuous Streaming to Hadoop •Oracle GoldenGate is also an option, for streaming RDBMS transactions to Hadoop •Leverages GoldenGate & HDFS / Hive Java APIs •Sample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive) •Likely to be formal part of GoldenGate in future release - but usable now •Can also integrate with Flume for delivery to HDFS - see MOS Doc.ID 1926867.1 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 43. Load Incoming Log Files into Hive Table •First step in process is to load the incoming log files into a Hive table ‣Also need to parse the log entries to extract request, date, IP address etc columns ‣Hive table can then easily be used in downstream transformations •Use IKM File to Hive (LOAD DATA) KM ‣Source can be local files or HDFS ‣Either load file into Hive HDFS area, or leave as external Hive table ‣Ability to use SerDe to parse file data T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com 1
  • 44. Using IKM File to Hive to Load Web Log File Data into Hive •Create mapping to load file source (single column for weblog entries) into Hive table •Target Hive table should have column for incoming log row, and parsed columns T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 45. Specifying a SerDe to Parse Incoming Hive Data •SerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formats •Distributed as JAR file, gives Hive ability to parse semi-structured formats •We can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columns •Enabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM option T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 46. Adding Social Media Datasources to the Hadoop Dataset •The log activity from the Rittman Mead website tells us what happened, but not “why” •Common customer requirement now is to get a “360 degree view” of their activity ‣Understand what’s being said about them ‣External drivers for interest, activity ‣Understand more about customer intent, opinions •One example is to add details of social media mentions, likes, tweets and retweets etc to the transactional dataset ‣Correlate twitter activity with sales increases, drops ‣Measure impact of social media strategy ‣Gather and include textual, sentiment, contextual data from surveys, media etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 47. Example : Supplement Webserver Log Activity with Twitter Data •Datasift provide access to the Twitter “firehose” along with Facebook data, Tumblr etc •Developer-friendly APIs and ability to define search terms, keywords etc •Pull (historical data) or Push (real-time) delivery using many formats / end-points ‣Most commonly-used consumption format is JSON, loaded into Redis, MongoDB etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 48. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com What is MongoDB? •Open-source document-store NoSQL database •Flexible data model, each document (record) can have its own JSON schema •Highly-scalable across multiple nodes (shards) •MongoDB databases made up of collections of documents ‣Add new attributes to a document just by using it ‣Single table (collection) design, no joins etc ‣Very useful for holding JSON output from web apps - for example, twitter data from Datasift
  • 49. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Hive and MongoDB •MongoDB Hadoop connector provides a storage handler for Hive tables •Rather than store its data in HDFS, the Hive table uses MongoDB for storage instead •Define in SerDe properties the Collection elements you want to access, using dot notation •https://github.com/mongodb/mongo-hadoop CREATE TABLE tweet_data( interactionId string, username string, content string, author_followers int) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{"interactionId":"interactionId", "username":"interaction.interaction.author.username", "content":"interaction.interaction.content", "author_followers_count":"interaction.twitter.user.followers_count"}' ) TBLPROPERTIES ( 'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' )
  • 50. Demo MongoDB and the Incoming Twitter Dataset T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 51. Adding MongoDB Datasets into the ODI Repository •Define Hive table outside of ODI, using MongoDB storage handler •Select the document elements of interest, project into Hive columns •Add Hive source to Topology if needed, then use Hive RKM to bring in column metadata T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 52. Join to Additional Hive Tables, Transform using HiveQL •IKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc. •INSERT only, no DELETE, UPDATE etc •Not all ODI12c mapping operators supported, but basic functionality works OK •Use this KM to join to other Hive tables, adding more details on post, title etc •Perform DISTINCT on join output, load into summary Hive table T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com 2
  • 53. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Joining Hive Tables •Only equi-joins supported •Must use ANSI syntax •More complex joins may not produce valid HiveQL (subqueries etc)
  • 54. Filtering, Aggregating and Transforming Within Hive •Aggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mapping operators can be added to mapping to manipulate data •Generates HiveQL functions, clauses etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 55. Bring in Reference Data from Oracle Database •In this third step, additional reference data from Oracle Database needs to be added •In theory, should be able to add Oracle-sourced datastores to mapping and join as usual •But … Oracle / JDBC-generic LKMs don’t get work with Hive T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com 3
  • 56. Options for Importing Oracle / RDBMS Data into Hadoop •Could export RBDMS data to file, and load using IKM File to Hive •Oracle Big Data Connectors only export to Oracle, not import to Hadoop •Best option is to use Apache Sqoop, and new IKM SQL to Hive-HBase-File knowledge module •Hadoop-native, automatically runs in parallel •Uses native JDBC drivers, or OraOop (for example) •Bi-directional in-and-out of Hadoop to RDBMS •Run from OS command-line T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 57. Loading RDBMS Data into Hive using Sqoop •First step is to stage Oracle data into equivalent Hive table •Use special LKM SQL Multi-Connect Global load knowledge module for Oracle source ‣Passes responsibility for load (extract) to following IKM •Then use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive table T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 58. Join Oracle-Sourced Hive Table to Existing Hive Table •Oracle-sourced reference data in Hive can then be joined to existing Hive table as normal •Filters, aggregation operators etc can be added to mapping if required •Use IKM Hive Control Append as integration KM T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 59. New Option - Using Oracle Big Data SQL •Oracle Big Data SQL provides ability for Exadata to reference Hive tables •Use feature to create join in Oracle, bringing across Hive data through ORACLE_HIVE table T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 60. Using Hive Streaming and Python for Geocoding Data •Another requirement we have is to “geocode” the webserver log entries •Allows us to aggregate page views by country •Based on the fact that IP ranges can usually be attributed to specific countries •Not functionality normally found in Hive etc, but can be done with add-on APIs T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com 4
  • 61. How GeoIP Geocoding Works •Uses free Geocoding API and database from Maxmind •Convert IP address to an integer •Find which integer range our IP address sits within •But Hive can’t use BETWEEN in a join… T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 62. Solution : IKM Hive Transform •IKM Hive Transform can pass the output of a Hive SELECT statement through a perl, python, shell etc script to transform content •Uses Hive TRANSFORM … USING … AS functionality hive> add file file:///tmp/add_countries.py; Added resource: file:///tmp/add_countries.py hive> select transform (hostname,request_date,post_id,title,author,category) > using 'add_countries.py' > as (hostname,request_date,post_id,title,author,category,country) > from access_per_post_categories; T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 63. Creating the Python Script for Hive Streaming •Solution requires a Python API to be installed on all Hadoop nodes, along with geocode DB wget —no-check-certificate https://raw.github.com/pypa/pip/master/contrib/get-pip.py python get-pip.py pip install pygeoip •Python script then parses incoming stdin lines using tab-separation of fields, outputs same (but with extra field for the country) #!/usr/bin/python import sys sys.path.append('/usr/lib/python2.6/site-packages/') import pygeoip gi = pygeoip.GeoIP('/tmp/GeoIP.dat') for line in sys.stdin: line = line.rstrip() hostname,request_date,post_id,title,author,category = line.split('t') country = gi.country_name_by_addr(hostname) print hostname+'t'+request_date+'t'+post_id+'t'+title+'t'+author T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com +'t'+country+'t'+category
  • 64. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Setting up the Mapping •Map source Hive table to target, which includes column for extra “country” column •Copy script + GeoIP.dat file to every node’s /tmp directory •Ensure all Python APIs and libraries are installed on each Hadoop node
  • 65. Configuring IKM Hive Transform •TRANSFORM_SCRIPT_NAME specifies name of script, and path to script •TRANSFORM_SCRIPT has issues with parsing; do not use, leave blank and KM will use existing one •Optional ability to specify sort and distribution columns (can be compound) •Leave other options at default T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 66. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Executing the Mapping •KM automatically registers the script with Hive (which caches it on all nodes) •HiveQL output then runs the contents of the first Hive table through the script, outputting results to target table
  • 67. Bulk Unload Summary Data to Oracle Database •Final requirement is to unload final Hive table contents to Oracle Database •Several use-cases for this: •Use Hadoop / BDA for ETL offloading •Use analysis capabilities of BDA, but then output results to RDBMS data mart or DW •Permit use of more advanced SQL query tools •Share results with other applications •Can use Sqoop for this, or use Oracle Big Data Connectors •Fast bulk unload, or transparent Oracle access to Hive T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com 5
  • 68. IKM File/Hive to Oracle (OLH/ODCH) •KM for accessing HDFS/Hive data from Oracle •Either sets up ODCH connectivity, or bulk-unloads via OLH •Map from HDFS or Hive source to Oracle tables (via Oracle technology in Topology) T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 69. Configuring the KM Physical Settings •For the access table in Physical view, change LKM to LKM SQL Multi-Connect •Delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect IKM such as IKM File/Hive to Oracle T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 70. Create Package to Sequence ETL Steps •Define package (or load plan) within ODI12c to orchestrate the process •Call package / load plan execution from command-line, web service call, or schedule T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 71. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Execute Overall Package •Each step executed in sequence •End-to-end ETL process, using ODI12c’s metadata-driven development process, data quality handing, heterogenous connectivity, but Hadoop-native processing
  • 72. Coming Soon… Going beyond Hive, Pig & MapReduce & Oracle’s New Big Data Products T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 73. Hadoop 2.0 : YARN, Spark and Tez •Hadoop 2.0 breaks the link between Hadoop and MapReduce •Separates out resource management from job scheduling •Introduces a new component - YARN ‣“Yet Another Resource Manager” ‣Backwards-compatible with Hadoop 1.0 •Makes Hadoop and YARN more of an “OS” •Makes it possible to run other processing types on Hadoop ‣For example, Apache Spark, Apache Tez •Now used in CDH5, Hadoop 2.0 etc T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 74. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Apache Tez •Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc •Models processing as an entire data flow graph (DAG), rather than separate job steps ‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems ‣Dataflow steps pass data between them as streams, rather than writing/reading from disk •Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez •Favoured In-memory / Hive v2 route by Hortonworks Pig/Hive - MR Pig/Hive - Tez
  • 75. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Apache Spark •Another DAG execution engine running on YARN •More mature than TEZ, with richer API and more vendor support •Uses concept of an RDD (Resilient Distributed Dataset) ‣RDDs like tables or Pig relations, but can be cached in-memory ‣Great for in-memory transformations, or iterative/cyclic processes •Spark jobs comprise of a DAG of tasks operating on RDDs •Access through Scala, Python or Java APIs •Related projects include ‣Spark SQL ‣Spark Streaming
  • 76. Apache Spark Example : Simple Log Analysis scala> val logfile = sc.textFile("logs/access_log") 14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) called with curMem=234759, maxMem=309225062 14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 75.5 KB, free 294.6 MB) logfile: org.apache.spark.rdd.RDD[String] = MappedRDD[31] at textFile at <console>:15 scala> logfile.count() 14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1 14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1 ... 14/05/12 21:19:06 INFO SparkContext: Job finished: count at <console>:18, took 0.192536694 s res7: Long = 154563 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com •Load logfile into RDD, do row count •Load logfile into RDD and cache it, create another RDD from it filtered on /biapps11g/ scala> val logfile = sc.textFile("logs/access_log").cache scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/")) biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17 scala> biapps11g.count() ... 14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s res9: Long = 403
  • 77. Apache Spark Example : Simple Log Analysis •Import a log parsing library, then use it to generate a list of URIs creating 404 errors scala> import com.alvinalexander.accesslogparser._ T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com val p = new AccessLogParser def getStatusCode(line: Option[AccessLogRecord]) = { line match { case Some(l) => l.httpStatusCode case None => "0" } } def getRequest(rawAccessLogString: String): Option[String] = { val accessLogRecordOption = p.parseRecord(rawAccessLogString) accessLogRecordOption match { case Some(rec) => Some(rec.request) case None => None } } def extractUriFromRequest(requestField: String) = requestField.split(" ")(1) log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).count val recs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)) val distinctRecs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404") .map(getRequest(_)) .collect { case Some(requestField) => requestField } .map(extractUriFromRequest(_)) .distinct distinctRecs.count distinctRecs.foreach(println)
  • 78. Coming Soon : Oracle Big Data Discovery •Combining of Endeca Server search, analysis and visualisation capabilities with Apache Spark data munging and transformation ‣Analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine ‣Create a catalog of the data on your Hadoop cluster, then search that catalog using Endeca Server ‣Create recommendations of other datasets, based on what you’re looking at now ‣Visualize your datasets, discover new insights T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 79. Coming Soon : Oracle Data Enrichment Cloud Service •Cloud-based service for loading, enriching, cleansing and supplementing Hadoop data •Part of the Oracle Data Integration product family •Used up-stream from Big Data Discovery •Aims to solve the “data quality problem” for Hadoop T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com
  • 80. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Conclusions •Hadoop, and the Oracle Big Data Appliance, is an excellent platform for data capture, analysis and processing •Hadoop tools such as Hive, Sqoop, MapReduce and Pig provide means to process and analyse data in parallel, using languages + approach familiar to Oracle developers •ODI12c provides several benefits when working with ETL and data loading on Hadoop ‣Metadata-driven design; data quality handling; KMs to handle technical complexity •Oracle Data Integrator Adapter for Hadoop provides several KMs for Hadoop sources •In this presentation, we’ve seen an end-to-end example of big data ETL using ODI ‣The power of Hadoop and BDA, with the ETL orchestration of ODI12c
  • 81. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Thank You for Attending! •Thank you for attending this presentation, and more information can be found at http:// www.rittmanmead.com •Contact us at info@rittmanmead.com or mark.rittman@rittmanmead.com •Look out for our book, “Oracle Business Intelligence Developers Guide” out now! •Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead)
  • 82. Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead UKOUG Tech’14 Super Sunday, December 2014 T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com