2. 2
Introduction
Jan Pieter Posthuma
Technical Lead Microsoft BI and
Big Data consultant
Inter Access, local consultancy firm in the
Netherlands
Architect role at multiple projects
Analysis Service, Reporting Service,
PerformancePoint Service, Big Data,
HDInsight, Cloud BI
http://twitter.com/jppp
http://linkedin.com/jpposthuma
jan.pieter.posthuma@interaccess.nl
3. 3
Expectations
What to cover
Simple ETL, so simple
sources
Different way to achieve the
result
What not to cover
Big Data
Best Practices
Deep internals Hadoop
5. 5
Hadoop
Hadoop is a collection of software to create a data-intensive
distributed cluster running on commodity hardware.
Widely accepted by Database vendors as a solution for
unstructured data
Microsoft partners with HortonWorks and delivers their
Hadoop Data Platform as Microsoft HDInsight
Available on premise and as an Azure service
HortonWorks Data Platform (HDP) 100% Open Source!
6. 6
Hadoop
FastLoad
Source Systems
Historical Data
(Beyond Active Window)
Summarize &
Load
Big Data Sources
(Raw, Unstructured)
Alerts, Notifications
Data & Compute Intensive
Application
ERP CRM LOB APPS
Integrate/Enrich
SQL Server
StreamInsight
Enterprise ETL with SSIS,
DQS, MDS
HDInsight on
Windows Azure
HDInsight on
Windows Server
SQL Server FTDW Data
Marts
SQL Server Reporting
Services
SQL Server Analysis
Server
Business
Insights
Interactive
Reports
Performance
Scorecards
Crawlers
Bots
Devices
Sensors
SQL Server Parallel Data
Warehouse
Azure Market Place
CREATE EXTERNAL TABLE Customer
WITH
(LOCATION=„hdfs://10.13.12.14:5000/user/Hadoop/Customer‟
, FORMAT_OPTIONS (FIELDS_TERMINATOR = „,‟)
AS
SELECT * FROM DimCustomer
7. 7
Hadoop
HDFS – distributed, fault tolerant file system
MapReduce – framework for writing/executing distributed,
fault tolerant algorithms
Hive & Pig – SQL-like declarative languages
Sqoop/PolyBase – package
for moving data between HDFS
and relational DB systems
+ Others…
HDFS
Map/
Reduce
Hive & Pig
Sqoop /
Poly
base
Avro(Serialization)
HBase
Zookeeper
ETL
Tools
BI
Reporting
RDBMS
9. 9
HDFS
NameNode BackupNode
DataNode DataNode DataNode DataNode DataNode
(heartbeat, balancing, replication, etc.)
nodes write to local disk
namespace backups
HDFS was designed with the
expectation that failures (both
hardware and software) would
occur frequently
10. 10
Map/Reduce
Programming framework (library and runtime) for analyzing
data sets stored in HDFS
MR framework provides all the “glue” and coordinates the
execution of the Map and Reduce jobs on the cluster.
– Fault tolerant
– Scalable
Map function:
var map = function(key, value, context) {}
Reduce function:
var reduce = function(key, values,
context) {}
Map/
Reduce
15. 15
Hive and Pig
Query:
Find the sourceIP address that generated the most adRevenue along
with its average pageRank
Rankings
(
pageURL STRING,
pageRank INT,
avgDuration INT
);
UserVisits
(
sourceIP STRING,
destURL STRING
visitDate DATE,
adRevenue FLOAT,
.. // fields omitted
);
Hive & Pig
package edu.brown.cs.mapreduce.benchmarks;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.fs.*;
import edu.brown.cs.mapreduce.BenchmarkBase;
public class Benchmark3 extends Configured implements Tool {
public static String getTypeString(int type) {
if (type == 1) {
return ("UserVisits");
} else if (type == 2) {
return ("Rankings");
}
return ("INVALID");
}
/* (non-Javadoc)
* @see org.apache.hadoop.util.Tool#run(java.lang.String[])
*/
public int run(String[] args) throws Exception {
BenchmarkBase base = new BenchmarkBase(this.getConf(), this.getClass(), args);
Date startTime = new Date();
System.out.println("Job started: " + startTime);
1
// Phase #1
// -------------------------------------------
JobConf p1_job = base.getJobConf();
p1_job.setJobName(p1_job.getJobName() + ".Phase1");
Path p1_output = new Path(base.getOutputPath().toString() + "/phase1");
FileOutputFormat.setOutputPath(p1_job, p1_output);
//
// Make sure we have our properties
//
String required[] = { BenchmarkBase.PROPERTY_START_DATE,
BenchmarkBase.PROPERTY_STOP_DATE };
for (String req : required) {
if (!base.getOptions().containsKey(req)) {
System.err.println("ERROR: The property '" + req + "' is not set");
System.exit(1);
}
} // FOR
p1_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :
KeyValueTextInputFormat.class);
if (base.getSequenceFile()) p1_job.setOutputFormat(SequenceFileOutputFormat.class);
p1_job.setOutputKeyClass(Text.class);
p1_job.setOutputValueClass(Text.class);
p1_job.setMapperClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableMap.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextMap.class);
p1_job.setReducerClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TupleWritableReduce.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase1.TextReduce.class);
p1_job.setCompressMapOutput(base.getCompress());
2
// Phase #2
// -------------------------------------------
JobConf p2_job = base.getJobConf();
p2_job.setJobName(p2_job.getJobName() + ".Phase2");
p2_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :
KeyValueTextInputFormat.class);
if (base.getSequenceFile()) p2_job.setOutputFormat(SequenceFileOutputFormat.class);
p2_job.setOutputKeyClass(Text.class);
p2_job.setOutputValueClass(Text.class);
p2_job.setMapperClass(IdentityMapper.class);
p2_job.setReducerClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TupleWritableReduce.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase2.TextReduce.class);
p2_job.setCompressMapOutput(base.getCompress());
// Phase #3
// -------------------------------------------
JobConf p3_job = base.getJobConf();
p3_job.setJobName(p3_job.getJobName() + ".Phase3");
p3_job.setNumReduceTasks(1);
p3_job.setInputFormat(base.getSequenceFile() ? SequenceFileInputFormat.class :
KeyValueTextInputFormat.class);
p3_job.setOutputKeyClass(Text.class);
p3_job.setOutputValueClass(Text.class);
//p3_job.setMapperClass(Phase3Map.class);
p3_job.setMapperClass(IdentityMapper.class);
p3_job.setReducerClass(base.getTupleData() ?
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TupleWritableReduce.class :
edu.brown.cs.mapreduce.benchmarks.benchmark3.phase3.TextReduce.class);
3
//
// Execute #1
//
base.runJob(p1_job);
//
// Execute #2
//
Path p2_output = new Path(base.getOutputPath().toString() + "/phase2");
FileOutputFormat.setOutputPath(p2_job, p2_output);
FileInputFormat.setInputPaths(p2_job, p1_output);
base.runJob(p2_job);
//
// Execute #3
//
Path p3_output = new Path(base.getOutputPath().toString() + "/phase3");
FileOutputFormat.setOutputPath(p3_job, p3_output);
FileInputFormat.setInputPaths(p3_job, p2_output);
base.runJob(p3_job);
// There does need to be a combine if (base.getCombine()) base.runCombine();
return 0;
}
}
4
16. 16
Hive and Pig
Principle is the same: easy data retrieval
Both use MapReduce
Different founders Facebook (Hive) and Yahoo (PIG)
Different language SQL like (Hive) and more procedural (PIG)
Both can store data in tables, which are stored as HDFS file(s)
Extra language options to use benefits of Hadoop
– Partition by statement
– Map/Reduce statement
„Of the 150k jobs Facebook runs daily, only 500 are
MapReduce jobs. The rest are is HiveQL‟
17. 17
Hive
Query 1: SELECT count_big(*) FROM lineitem
Query 2: SELECT max(l_quantity) FROM lineitem
WHERE l_orderkey>1000 and l_orderkey<100000
GROUP BY l_linestatus
0
500
1000
1500
Query 1 Query 2
1318
1397
252 279
Secs.
Hive
PDW
18. 18
Demo
Use the same data file as previous demo
But now we directly „query‟ the file
20. 20
Polybase
PDW v2 introduces external tables to represent HDFS data
PDW queries can now span HDFS and PDW data
Hadoop cluster is not part of the appliance
Social
Apps
Sensor
& RFID
Mobile
Apps
Web
Apps
Unstructured data Structured data
RDBMS
HDFS Enhanced
PDW
query engine
T-SQL
Relational
databases
Sqoo
p /
Poly
base
21. Polybase
SQL Server
SQL Server SQL Server
…
SQL Server
PDW Cluster
DN DN DN
DN DN DN
DN DN DN
DN DN DN
Hadoop Cluster
21
This is PDW!
22. 22
PDW Hadoop
1. Retrieve data from HDFS with a PDW query
– Seamlessly join structured and semi-structured data
2. Import data from HDFS to PDW
– Parallelized CREATE TABLE AS SELECT (CTAS)
– External tables as the source
– PDW table, either replicated or distributed, as destination
3. Export data from PDW to HDFS
– Parallelized CREATE EXTERNAL TABLE AS SELECT (CETAS)
– External table as the destination; creates a set of HDFS files
SELECT Username FROM ClickStream c, User u WHERE c.UserID = u.ID
AND c.URL=„www.bing.com‟;
CREATE TABLE ClickStreamInPDW WITH DISTRIBUTION = HASH(URL)
AS SELECT URL, EventDate, UserID FROM ClickStream;
CREATE EXTERNAL TABLE ClickStream2 (URL, EventDate, UserID)
WITH (LOCATION =„hdfs://MyHadoop:5000/joe‟, FORMAT_OPTIONS (...)
AS SELECT URL, EventDate, UserID FROM ClickStreamInPDW;
23. 23
Recap
Hadoop is the next big thing for DWH/BI
Not a replacement, but an new dimension
Many ways to integrate it‟s data
What‟s next?
– Polybase combined with (custom) Map/Reduce?
– HDInsight appliance?
– Polybase for SQL Server vNext?
24. 24
References
Microsoft BigData (HDInsight):
http://www.microsoft.com/bigdata
Microsoft HDInsight Azure (3 months free trail):
http://www.windowsazure.com
Hortonworks Data Platform sandbox (VMware):
http://hortonworks.com/download/
26. Coming up…
Speaker Title Room
Alberto Ferrari DAX Query Engine Internals Theatre
Wesley Backelant An introduction to the wonderful world of OData Exhibition B
Bob Duffy Windows Azure For SQL folk Suite 3
Dejan Sarka Excel 2013 Analytics Suite 1
Mladen Prajdić
From SQL Traces to Extended Events. The next big
switch. Suite 2
Sandip Pani New Analytic Functions in SQL server 2012 Suite 4
#SQLBITS