BI, Hive or Big Data Analytics?

BI, Hive or Big Data Analytics?

© 2012 Datameer, Inc. All rights reserved.

View the Recording of these Slides!
You can view the full recording of this
on-demand webinar with slides at:
http://info.datameer.com/Slideshare-BI-HiveBig-Data-Analytics.html
!


About our Speaker!
Todd Nash!
!

Todd is a founding Principal at CBIG
Consulting, a professional services ﬁrm that
helps clients leverage their data assets to
produce timely, effective business strategies
and tactical decisions. Todd leads CBIG’s
eastern region consulting practice in the
development, implementation, and execution of
business intelligence and Big Data
methodologies, cloud-based analytics
strategies, and complex data warehousing
solutions.!
!
Todd graduated from Clemson University with a
Bachelor of Science degree in Management
Information Systems.!

About our Speaker!
Eduardo Rosas!

!
Eduardo Rosas is Vice President of Services at
Datameer and brings over 12 years of software
implementation experience to the table.!
!
In this role, Eduardo is focused on delivering
repeatable, high quality level of services and
support to help clients achieve their goals. !
!
Prior to Datameer, Eduardo spent 11 years at
Trintech where he focused on managing a team of
Technical Consultants and implementing global
Java web based solutions. Eduardo is originally
from San Jose, CA and graduated from Santa
Clara University.!
!


Agenda

•  Problem
Statement
–
Business
&
Technical

•  POC
Technical
Solu;on
–
High-‐level
and
Detailed

•  Results

•  Lessons
Learned

Copyright
©
2013
CBIG
Consul;ng

5

PROBLEM STATEMENT

Copyright
©
2013
CBIG
Consul;ng

6

Business
Problem
Statement

A
Real
Estate
.com
business
makes
money
in
two
ways:

1.  Property
Owners
adver;se
proper;es

2.  Ancillary
businesses
adver;se
services

This
site
needs
the
analy;cs
to
show
customers
the
return
on
their
investment

SEARCH

IMPRESSIONS

CLICK-‐THRU

LEAD

Breadth:

•  Searches
to
Impressions
to
Click
Thru
to
Leads

•  Website
op;miza;on

•  Customer
op;miza;on
&
upgrades

•  Market
op;miza;on

Depth:

•  Can
the
search
criteria
be
op;mized?

•  Conversion
of
impressions
based
on
reﬁnement
of
search?

•  Which
product
mix
of
impressions
get
the
greatest
click
thru

•  What
is
the
impact
of
ameni;es
to
leads?

•  What
addi;onal
features
get
used
to
convert
to
leads?

Copyright
©
2013
CBIG
Consul;ng

7

Source

Source

• 
• 
• 
• 
• 

Web

Ac7vity

Master

Data

Search
&

Impression

ODS

Lookup

Data

Data
Movement

Source

Data
Movement

Source

Service

Search

Data

Movement

Technical
Problem
Statement

Search
&

Impression

EDW

Search

Cube

Sales
Cube

Marke7ng

Cube

Search & Impressions volume too large to build cube and provide deep analytics
This has a negative impact on all reporting and performance of the entire system
The business is unable to determine the value of all the data; has requests to add more
Evaluating options to increase environment or look for alternatives
POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge
Copyright
©
2013
CBIG
Consul;ng

8

Technical
Problem
Statement

Source

Source

• 
• 
• 
• 
• 

Web

Ac7vity

Master

Data

Search
&

Impression

ODS

Lookup

Data

Data
Movement

Source

Data
Movement

Source

Service

Search

Data

Movement

Search

EDW

Sales
Cube

Marke7ng

Cube

Search & Impressions volume too large to build cube and provide deep analytics
This has a negative impact on all reporting and performance of the entire system
The business is unable to determine the value of all the data; has requests to add more
Evaluating options to increase environment or look for alternatives
POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge
Copyright
©
2013
CBIG
Consul;ng

9

Problem
Statement
–
Success
Criteria

Objec7ve:

To
prove
that
the
Hadoop
architecture
is
an
excellent
op;on
for
the
business
to

interact
with
large
data
and
ﬁnd
dataset
and
rela;onships
that
require
deeper

analy;cs.

Original
Scope
&
Goals:

•  Bring
in
one
years
worth
of
data
from
6
tables,
into
the
Amazon
Cloud
Hadoop

environment.

•  IT
resources
will
be
able
to
extract
the
data
from
these
tables
and
load
them

into
.CSV
ﬁles.

•  The
success
criteria
for
this
stream
of
work
will
be:

ü  Amazon
Hadoop
cloud
environment
&
account
is
setup.

ü  Search
Analy;cs
data
loaded
into
the
Amazon
Hadoop
cloud

ü  Business
is
able
to
execute
and
perform
analy;cs
on
Search
Analy;cs
data

that
is
stored
in
Hadoop
with
acceptable
performance.

ü  Gain
analy;cal
insights
with
new
solu;on

Copyright
©
2013
CBIG
Consul;ng

10

POC TECHNICAL SOLUTION

Copyright
©
2013
CBIG
Consul;ng

11

POC
Technical
Solu;on
–
High
Level

Web
Ac7vity

History

Lookup
Data

AWS

S3

Datameer

(Data

Discovery)

Web
Portal

(Widget
Based

UI)

AWS

EMR

(Hadoop)

Amazon
Web
Services
(Cloud)

Copyright
©
2013
CBIG
Consul;ng

12

POC
Technical
Solu;on
-‐
Detailed

Amazon
Cloud

AllLeads

WebClicks

WebClicks

Web

Impressions

WebLead

Data
Movement

AllLeads

Web

Impressions

WebLead

WebSearch

WebSearch

WebVisit

WebVisit

Generic
Ac;vity

LR
Apts
IMPS

Other
Leads

EmailLeads

Data
Movement

EmailLeads

Phone
Leads

Generic
Ac;vity

LR
Apts
IMPS

Other
Leads

Phone
Leads

Site

SubSite

Site

Lead
Type

PageType

Lead
Type

Event
Type

PhoneType

Event
Type

Email
Type

Contaniner

Type

Aﬄiate

Product
ID

Email
Type

Contaniner

Type

Aﬄiate

Property
List

SearchType

S3

Hadoop

SubSite

PageType

PhoneType

Product
ID

Property
List

SearchType

Data
Workbooks

AllLeads

WebClicks

Web
Impressions

WebLeads

WebSearch

WebVisits

Use
Case
Workbooks

Use
Case1

Use
C
ase
2

Addi7onal
Data
Workbooks

Addi7onal

Use
Cases

RESULTS

Copyright
©
2013
CBIG
Consul;ng

14

POC
Results

Success
Criteria

Hadoop,
Amazon,
Datameer

environment
setup

Able
to
load
1
years
worth
of

data
–
nearly
1.3
TB

Business
able
to
execute
and

perform
analy;cs

Users
provided
acceptable

performance

Gain
new
insights

Results

Environment
setup
within
the
1st
couple
of
days

Loaded
signiﬁcantly
more
data
than
planned
for

more
robust
analy;cs

Business
leveraged
Datameer
to
execute
use
cases;

executed
~20
addi;onal
without
IT
help

Queries
executed
to
comple;on.
Some
took
seconds,

some
took
minutes
and
some
required
overnight.

1st
;me
able
to
run
these
analy;cs.

Found
pajerns

and
rela;onships
contrary
to
assump;ons.

Will
be

upda;ng
service
oﬀerings
&
marke;ng
plans
because

of
POC

Copyright
©
2013
CBIG
Consul;ng

15

LESSONS LEARNED

Copyright
©
2013
CBIG
Consul;ng

16

Lessons
Learned

GETTING
DATA
TO
HADOOP

Hadoop
is
ﬁle
structure

Finding
the
right
delimiter

Integra;ng
data

Requires
ETL

Data
cleansing
can
be
big

Several
itera;ons
required

CLOUD

Cloud
ﬂexible

Easy
setup
and
scaling

Performance
&
sizing

Sizing
the
cloud
is
challenging

Cost
for
performance

TBs
with
support
becomes
costly

HADOOP

Hadoop
is
batch

Answers
one
thing
at
a
;me

Analy;cs

Move
to
database
w/
tools

PEOPLE

Remember
change
mgmt

Educa;on
new
methods
&
tools

Copyright
©
2013
CBIG
Consul;ng

17

So what about open source
tools like hive?


Hive…!

! 

Goal of hive!
• 

! 

Eases the complexity of writing
MapReduce jobs by providing the
technical user a set of tools that are
more familiar with via sql!

Who can use hive?!
• 

SQL Users can pick up hql basics fairly
quickly!
! 

Prerequisites!
• 
• 
• 

Must have data in hadoop!
The data must be CLEAN!
Schema must be applied to the
data by creating a hive table!


What is hive really good at?!
! 

Hive is good in environments where we have clean prepared
data that doesn’t change often already in hadoop!

!
! 

Resembles a language that many IT folks are already familiar
with.!

!
! 

Hive can help a user trying to identify a reporting trend!

!
! 

User deﬁned ﬁelds (UDFs) can be used to reuse functions!


Some troubles!
<< - Start of Hive script ->>
--Create an TEMP Housing Table

CREATE EXTERNAL TABLE MY_TABLE(
num_ods
string,
num_bus_id int,
um_ctry_cd
int,
prod_id
string,
rng_svc_cd string,
rng6
string,
bin string,
bin_bus_id_enr
int,
bin_ctry_cd int,
cd_fmt_a_2
string,
cd_enr string,
rsn_us_ind string,
x_bus_id
int,
flg_enr
string,
my_dt string,
user_id
string,
mthd_cd_enr
string,
tran_seq_id string,
cd_enr2
string,
us_amt
string,
moto_cd
string,
fee_curr_cd
int,
fee_desc_num
string,
fee_sgn_amt
string,
us_fee_sgn_amt
string,
mkt_spec
string,
catg_cd
int,
city_enr
string,
ctry_cd_enr
int,
dba_id
int,
nm_dscrptr string,
geo_id
int,
geo_phone_num
string,
tier_cd
string,
msa
string,
nrmlzd_id int,
pstl_cd
string,
b_st_cd_enr string,
b_store_id
string,
b_vrfcn_val string,
ntwrk_id
int,
site string,

! 
! 
! 
! 
! 

entry_mode_cd
string,
term_cpbty_cd
string,
sub_typ_cd string,
dt string,
id_num_enr int,
prod_num
int,
prod_ppd_sub_typ_cd
string,
prod_typ_cd_enr string,
prod_typ_ext_enr
string,
promo_cd
string,
promo_typ
string,
rwds_pgm_id_enr string,
tran_cd string,
tran_gmt_dt
string,
tran_gmt_tm
string,
tran_id string,
unfrzn_acct_num_bus_id_enr
int,
unfrzn_arn_bin_bus_id_enr
int,
usage_cd_enr
string,
Other_amt
string,
curr_cd
int,
dt
string,
)COMMENT "THIS IS MY TEMP TABLE";

--INSERT DATA INTO MY_TABLE

INSERT OVERWRITE MY_TABLE
select * ,
SUM(us_tran_amt) AS SALES_VOL,
SUM(US_FEE_SGN_AMT) AS US_FEE_SGN_AMT,
COUNT(*) AS TRAN_COUNT,
MIN(ACTIVE_DT) AS FIRST_ACTIVE_DT,
MAX(SEARCH_DT) AS LAST_SEARCH_DT,
MAX(customer_biz_id) AS customer_biz_id,
MAX(PGM_ID_ENR) AS PGM_ID_ENR,
MAX(CUST_PROD_ID) AS CUST_PROD_ID ,
MAX(POD_ID_NUM_ENR) AS POD_ID_NUM_ENR,
MAX(PROD_TYPE)AS PROD_TYPE,
MAX(SUB_TYPE) AS SUB_TYPE,
1 as ID
from MY_TABLE
WHERE dt like '2012%'
GROUP BY
customer_biz_id,
PGM_ID_ENR,
CUST_PROD_ID,

eci_moto_cd,
catg_cd,
city_enr,
ctry_cd_enr,
pstl_cd,
pod_id,
prod_num,
SUB_TYPE;

--CREATE TEMP LOOKUP TABLE

CREATE EXTERNAL TABLE TEMP_LOOKUP(
acct_num
bigint,
acct_sta_cd
string,
acct_zip_cd
string,
rwrd_pgm_id
string,
pgm_ref_cd

string,
acct_prod_id
string,
bus_id
int,
bin
int,
status
string,
pgm_eff_dt
string,
dt
string,
)COMMENT "THIS IS TEMP LOOKUP TABLE";

--INSERT DATA INTO IT

INSERT OVERWRITE MY_LOOKUP
SELECT *, 1 as cmf_ind
FROM LOOKUP
WHERE DT = '201211';

--Do a Full Outer Join

SELECT * FROM MY_TABLE mt
FULL OUTER JOIN
MY_LOOKUP ml
ON mt.member_id = ml.member_id;

No way to get data in hadoop!
No data validation / may throw data away!
Security !
Sharing code via teams is a challenge!
No visualization!


… but it’s free right?!
! 

! 

"Time to create Hive":  
 
Any machine-generated data (or anything semi/unstructured) must ﬁrst be parsed by writing !!
!MapReduce or Pig/Python programs. Time-to-market disadvantage. 
Table deﬁnition is a manual effort (though this can be made easier by 3rd party tools). 
!
"Time to maintain Hive": 
 
Hive data models (tables) are most likely static, shared objects maintained and controlled by a few
people who own the schema !
Hive is also more of a black box for new employees coming in (so employee churn creates more
maintenance effort). !

!
! 

Cost to implement Hive: 
 
This is mostly down to the human capital (expensive developers), and don't forget the prerequisite
cost of implementing the data ingestion stage of the pipeline (populating the warehouse by writing
MapReduce programs or other programs parsing/loading the data). !


Business decsion!
!
! 

Do I train my engineers on a language or
eliminate the need from this by taking the problem
directly to the business user.!

!


So what would my hive resource need
to know?!
! 

!

! 

Hive QL (different dialect than ANSI standard SQL) 
!
MapReduce TUNING parameters. (to name a few)!
•  Data block size!
•  Number of mappers/reducers!
•  Compression at map out level; result compression; what codec to use!
•  io.sort.factor !!
Access to hive is mainly done via Command line interface!


How does Datameer do it differently
!


Questions and Answers!


Online Resources

§ 
§ 

Try Datameer: www.datameer.com!
Follow us on Twitter @datameer!

!

!

BI, Hive or Big Data Analytics?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie BI, Hive or Big Data Analytics?

Ähnlich wie BI, Hive or Big Data Analytics? (20)

Mehr von Datameer

Mehr von Datameer (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

BI, Hive or Big Data Analytics?