Weitere ähnliche Inhalte Ähnlich wie BI, Hive or Big Data Analytics? (20) Kürzlich hochgeladen (20) BI, Hive or Big Data Analytics? 1. BI, Hive or Big Data Analytics?
© 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.
2. View the Recording of these Slides!
You can view the full recording of this
on-demand webinar with slides at:
http://info.datameer.com/Slideshare-BI-HiveBig-Data-Analytics.html
!
© 2012 Datameer, Inc. All rights reserved.
3. About our Speaker!
Todd Nash!
!
Todd is a founding Principal at CBIG
Consulting, a professional services firm that
helps clients leverage their data assets to
produce timely, effective business strategies
and tactical decisions. Todd leads CBIG’s
eastern region consulting practice in the
development, implementation, and execution of
business intelligence and Big Data
methodologies, cloud-based analytics
strategies, and complex data warehousing
solutions.!
!
Todd graduated from Clemson University with a
Bachelor of Science degree in Management
Information Systems.!
© 2012 Datameer, Inc. All rights reserved.
4. About our Speaker!
Eduardo Rosas!
!
Eduardo Rosas is Vice President of Services at
Datameer and brings over 12 years of software
implementation experience to the table.!
!
In this role, Eduardo is focused on delivering
repeatable, high quality level of services and
support to help clients achieve their goals. !
!
Prior to Datameer, Eduardo spent 11 years at
Trintech where he focused on managing a team of
Technical Consultants and implementing global
Java web based solutions. Eduardo is originally
from San Jose, CA and graduated from Santa
Clara University.!
!
© 2012 Datameer, Inc. All rights reserved.
5. Agenda
• Problem
Statement
–
Business
&
Technical
• POC
Technical
Solu;on
–
High-‐level
and
Detailed
• Results
• Lessons
Learned
Copyright
©
2013
CBIG
Consul;ng
5
7. Business
Problem
Statement
A
Real
Estate
.com
business
makes
money
in
two
ways:
1. Property
Owners
adver;se
proper;es
2. Ancillary
businesses
adver;se
services
This
site
needs
the
analy;cs
to
show
customers
the
return
on
their
investment
SEARCH
IMPRESSIONS
CLICK-‐THRU
LEAD
Breadth:
• Searches
to
Impressions
to
Click
Thru
to
Leads
• Website
op;miza;on
• Customer
op;miza;on
&
upgrades
• Market
op;miza;on
Depth:
• Can
the
search
criteria
be
op;mized?
• Conversion
of
impressions
based
on
refinement
of
search?
• Which
product
mix
of
impressions
get
the
greatest
click
thru
• What
is
the
impact
of
ameni;es
to
leads?
• What
addi;onal
features
get
used
to
convert
to
leads?
Copyright
©
2013
CBIG
Consul;ng
7
8. Source
Source
•
•
•
•
•
Web
Ac7vity
Master
Data
Search
&
Impression
ODS
Lookup
Data
Data
Movement
Source
Data
Movement
Source
Service
Search
Data
Movement
Technical
Problem
Statement
Search
&
Impression
EDW
Search
Cube
Sales
Cube
Marke7ng
Cube
Search & Impressions volume too large to build cube and provide deep analytics
This has a negative impact on all reporting and performance of the entire system
The business is unable to determine the value of all the data; has requests to add more
Evaluating options to increase environment or look for alternatives
POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge
Copyright
©
2013
CBIG
Consul;ng
8
9. Technical
Problem
Statement
Source
Source
•
•
•
•
•
Web
Ac7vity
Master
Data
Search
&
Impression
ODS
Lookup
Data
Data
Movement
Source
Data
Movement
Source
Service
Search
Data
Movement
Search
EDW
Sales
Cube
Marke7ng
Cube
Search & Impressions volume too large to build cube and provide deep analytics
This has a negative impact on all reporting and performance of the entire system
The business is unable to determine the value of all the data; has requests to add more
Evaluating options to increase environment or look for alternatives
POC to evaluate how Hadoop, Amazon cloud and Datameer could support challenge
Copyright
©
2013
CBIG
Consul;ng
9
10. Problem
Statement
–
Success
Criteria
Objec7ve:
To
prove
that
the
Hadoop
architecture
is
an
excellent
op;on
for
the
business
to
interact
with
large
data
and
find
dataset
and
rela;onships
that
require
deeper
analy;cs.
Original
Scope
&
Goals:
• Bring
in
one
years
worth
of
data
from
6
tables,
into
the
Amazon
Cloud
Hadoop
environment.
• IT
resources
will
be
able
to
extract
the
data
from
these
tables
and
load
them
into
.CSV
files.
• The
success
criteria
for
this
stream
of
work
will
be:
ü Amazon
Hadoop
cloud
environment
&
account
is
setup.
ü Search
Analy;cs
data
loaded
into
the
Amazon
Hadoop
cloud
ü Business
is
able
to
execute
and
perform
analy;cs
on
Search
Analy;cs
data
that
is
stored
in
Hadoop
with
acceptable
performance.
ü Gain
analy;cal
insights
with
new
solu;on
Copyright
©
2013
CBIG
Consul;ng
10
12. POC
Technical
Solu;on
–
High
Level
Web
Ac7vity
History
Lookup
Data
AWS
S3
Datameer
(Data
Discovery)
Web
Portal
(Widget
Based
UI)
AWS
EMR
(Hadoop)
Amazon
Web
Services
(Cloud)
Copyright
©
2013
CBIG
Consul;ng
12
13. POC
Technical
Solu;on
-‐
Detailed
Amazon
Cloud
AllLeads
WebClicks
WebClicks
Web
Impressions
WebLead
Data
Movement
AllLeads
Web
Impressions
WebLead
WebSearch
WebSearch
WebVisit
WebVisit
Generic
Ac;vity
LR
Apts
IMPS
Other
Leads
EmailLeads
Data
Movement
EmailLeads
Phone
Leads
Generic
Ac;vity
LR
Apts
IMPS
Other
Leads
Phone
Leads
Site
SubSite
Site
Lead
Type
PageType
Lead
Type
Event
Type
PhoneType
Event
Type
Email
Type
Contaniner
Type
Affliate
Product
ID
Email
Type
Contaniner
Type
Affliate
Property
List
SearchType
S3
Hadoop
SubSite
PageType
PhoneType
Product
ID
Property
List
SearchType
Data
Workbooks
AllLeads
WebClicks
Web
Impressions
WebLeads
WebSearch
WebVisits
Use
Case
Workbooks
Use
Case1
Use
C
ase
2
Addi7onal
Data
Workbooks
Addi7onal
Use
Cases
15. POC
Results
Success
Criteria
Hadoop,
Amazon,
Datameer
environment
setup
Able
to
load
1
years
worth
of
data
–
nearly
1.3
TB
Business
able
to
execute
and
perform
analy;cs
Users
provided
acceptable
performance
Gain
new
insights
Results
Environment
setup
within
the
1st
couple
of
days
Loaded
significantly
more
data
than
planned
for
more
robust
analy;cs
Business
leveraged
Datameer
to
execute
use
cases;
executed
~20
addi;onal
without
IT
help
Queries
executed
to
comple;on.
Some
took
seconds,
some
took
minutes
and
some
required
overnight.
1st
;me
able
to
run
these
analy;cs.
Found
pajerns
and
rela;onships
contrary
to
assump;ons.
Will
be
upda;ng
service
offerings
&
marke;ng
plans
because
of
POC
Copyright
©
2013
CBIG
Consul;ng
15
17. Lessons
Learned
GETTING
DATA
TO
HADOOP
Hadoop
is
file
structure
Finding
the
right
delimiter
Integra;ng
data
Requires
ETL
Data
cleansing
can
be
big
Several
itera;ons
required
CLOUD
Cloud
flexible
Easy
setup
and
scaling
Performance
&
sizing
Sizing
the
cloud
is
challenging
Cost
for
performance
TBs
with
support
becomes
costly
HADOOP
Hadoop
is
batch
Answers
one
thing
at
a
;me
Analy;cs
Move
to
database
w/
tools
PEOPLE
Remember
change
mgmt
Educa;on
new
methods
&
tools
Copyright
©
2013
CBIG
Consul;ng
17
18. So what about open source
tools like hive?
© 2012 Datameer, Inc. All rights reserved.
© 2012 Datameer, Inc. All rights reserved.
19. Hive…!
!
Goal of hive!
•
!
Eases the complexity of writing
MapReduce jobs by providing the
technical user a set of tools that are
more familiar with via sql!
Who can use hive?!
•
SQL Users can pick up hql basics fairly
quickly!
!
Prerequisites!
•
•
•
Must have data in hadoop!
The data must be CLEAN!
Schema must be applied to the
data by creating a hive table!
© 2012 Datameer, Inc. All rights reserved.
20. What is hive really good at?!
!
Hive is good in environments where we have clean prepared
data that doesn’t change often already in hadoop!
!
!
Resembles a language that many IT folks are already familiar
with.!
!
!
Hive can help a user trying to identify a reporting trend!
!
!
User defined fields (UDFs) can be used to reuse functions!
© 2012 Datameer, Inc. All rights reserved.
21. Some troubles!
<< - Start of Hive script ->>
--Create an TEMP Housing Table
CREATE EXTERNAL TABLE MY_TABLE(
num_ods
string,
num_bus_id int,
um_ctry_cd
int,
prod_id
string,
rng_svc_cd string,
rng6
string,
bin string,
bin_bus_id_enr
int,
bin_ctry_cd int,
cd_fmt_a_2
string,
cd_enr string,
rsn_us_ind string,
x_bus_id
int,
flg_enr
string,
my_dt string,
user_id
string,
mthd_cd_enr
string,
tran_seq_id string,
cd_enr2
string,
us_amt
string,
moto_cd
string,
fee_curr_cd
int,
fee_desc_num
string,
fee_sgn_amt
string,
us_fee_sgn_amt
string,
mkt_spec
string,
catg_cd
int,
city_enr
string,
ctry_cd_enr
int,
dba_id
int,
nm_dscrptr string,
geo_id
int,
geo_phone_num
string,
tier_cd
string,
msa
string,
nrmlzd_id int,
pstl_cd
string,
b_st_cd_enr string,
b_store_id
string,
b_vrfcn_val string,
ntwrk_id
int,
site string,
!
!
!
!
!
entry_mode_cd
string,
term_cpbty_cd
string,
sub_typ_cd string,
dt string,
id_num_enr int,
prod_num
int,
prod_ppd_sub_typ_cd
string,
prod_typ_cd_enr string,
prod_typ_ext_enr
string,
promo_cd
string,
promo_typ
string,
rwds_pgm_id_enr string,
tran_cd string,
tran_gmt_dt
string,
tran_gmt_tm
string,
tran_id string,
unfrzn_acct_num_bus_id_enr
int,
unfrzn_arn_bin_bus_id_enr
int,
usage_cd_enr
string,
Other_amt
string,
curr_cd
int,
dt
string,
)COMMENT "THIS IS MY TEMP TABLE";
--INSERT DATA INTO MY_TABLE
INSERT OVERWRITE MY_TABLE
select * ,
SUM(us_tran_amt) AS SALES_VOL,
SUM(US_FEE_SGN_AMT) AS US_FEE_SGN_AMT,
COUNT(*) AS TRAN_COUNT,
MIN(ACTIVE_DT) AS FIRST_ACTIVE_DT,
MAX(SEARCH_DT) AS LAST_SEARCH_DT,
MAX(customer_biz_id) AS customer_biz_id,
MAX(PGM_ID_ENR) AS PGM_ID_ENR,
MAX(CUST_PROD_ID) AS CUST_PROD_ID ,
MAX(POD_ID_NUM_ENR) AS POD_ID_NUM_ENR,
MAX(PROD_TYPE)AS PROD_TYPE,
MAX(SUB_TYPE) AS SUB_TYPE,
1 as ID
from MY_TABLE
WHERE dt like '2012%'
GROUP BY
customer_biz_id,
PGM_ID_ENR,
CUST_PROD_ID,
eci_moto_cd,
catg_cd,
city_enr,
ctry_cd_enr,
pstl_cd,
pod_id,
prod_num,
SUB_TYPE;
--CREATE TEMP LOOKUP TABLE
CREATE EXTERNAL TABLE TEMP_LOOKUP(
acct_num
bigint,
acct_sta_cd
string,
acct_zip_cd
string,
rwrd_pgm_id
string,
pgm_ref_cd
string,
acct_prod_id
string,
bus_id
int,
bin
int,
status
string,
pgm_eff_dt
string,
dt
string,
)COMMENT "THIS IS TEMP LOOKUP TABLE";
--INSERT DATA INTO IT
INSERT OVERWRITE MY_LOOKUP
SELECT *, 1 as cmf_ind
FROM LOOKUP
WHERE DT = '201211';
--Do a Full Outer Join
SELECT * FROM MY_TABLE mt
FULL OUTER JOIN
MY_LOOKUP ml
ON mt.member_id = ml.member_id;
No way to get data in hadoop!
No data validation / may throw data away!
Security !
Sharing code via teams is a challenge!
No visualization!
© 2012 Datameer, Inc. All rights reserved.
22. … but it’s free right?!
!
!
"Time to create Hive":
Any machine-generated data (or anything semi/unstructured) must first be parsed by writing !!
!MapReduce or Pig/Python programs. Time-to-market disadvantage.
Table definition is a manual effort (though this can be made easier by 3rd party tools).
!
"Time to maintain Hive":
Hive data models (tables) are most likely static, shared objects maintained and controlled by a few
people who own the schema !
Hive is also more of a black box for new employees coming in (so employee churn creates more
maintenance effort). !
!
!
Cost to implement Hive:
This is mostly down to the human capital (expensive developers), and don't forget the prerequisite
cost of implementing the data ingestion stage of the pipeline (populating the warehouse by writing
MapReduce programs or other programs parsing/loading the data). !
© 2012 Datameer, Inc. All rights reserved.
23. Business decsion!
!
!
Do I train my engineers on a language or
eliminate the need from this by taking the problem
directly to the business user.!
!
© 2012 Datameer, Inc. All rights reserved.
24. So what would my hive resource need
to know?!
!
!
!
Hive QL (different dialect than ANSI standard SQL)
!
MapReduce TUNING parameters. (to name a few)!
• Data block size!
• Number of mappers/reducers!
• Compression at map out level; result compression; what codec to use!
• io.sort.factor !!
Access to hive is mainly done via Command line interface!
© 2012 Datameer, Inc. All rights reserved.