Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
using storage: parquet, ORC, RCFile and Avro
Compression: snappy, zlib and default compression (gzip)
4. Goal of Benchmark
• Goal of this benchmark is :
• To provide a comprehensive overview and testing of the interactive SQL on
Hadoop
• To measure response time in terms of performance across each of the
platform across different storage formats
• To measure compression – data size – across different dimensions
• To have a better understanding of the performance gain we may potentially
see with Queries on each of these platforms.
• Avro is widely used across LinkedIn and testing it on the newer platform (hive
13) with other tools (tez, presto etc) will give us a good understanding of the
performance gain we may potentially see.
5. System, Storage formats and compression
Systems chosen:
Hive - version 13.1
Hive13 on Tez + Yarn - Tez version: 0.4.1
Presto - 0.74, 0.79 and 0.80 version
Storage Formats and compression:
ORC + zlib compression
RCFile + snappy
Parquet + snappy
Avro + Avro native compression - Deflate level 9
6. System, Storage formats and compression
• Presto - dataset was created in RCFile.
• This was the most recommended version for Presto when this was
evaluated.
• During the time of the evaluation, Presto had issues working with
Avro and Parquet. The queries were either not running or they were
not quite optimized.
• With Presto v0.80 releasing, we have tested it with ORCFile format.
• We flattened certain data (pageviewevent) to support benchmark on
Presto.
• Currently Presto supports only Map datatypes. Struct etc use
Json_extract function.
7. About Hive 13
• This is the next version of Hive at LinkedIn.
• Hive is used heavily at LinkedIn for interactive sql capability where
users are not PigLatin savvy and prefer a sql solution.
• Hive is generally slow as it runs on Map/reduce and competes with
Mappers and reducers on the HDFS system along with PigLatin and
vanilla m/r.
8. About Hive 13-on-Tez
• Tez is a new application framework built on Hadoop Yarn that can execute complex directed
acyclic graphs of general data processing tasks. In many ways it can be thought of as a more
flexible and powerful successor of the map-reduce framework built by HortonWorks
• It generalizes map and reduce tasks by exposing interfaces for generic data processing tasks,
which consist of a triplet of interfaces: input, output and processor. These tasks are the
vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class
citizens in Tez and together with the input/output interfaces greatly increase the flexibility of
how data is transferred between tasks.
• Tez also greatly extends the possible ways of which individual tasks can be linked together;
In fact any arbitrary DAG can be executed directly in Tez. In Tez parlance a map-reduce job is
basically a simple DAG consisting of a single map and reduce vertice connected by a
“bipartite” edge (i.e.: the edge connects every map task to every reduce task). Map input
and reduce outputs are HDFS inputs and outputs respectively. The map output class locally
sorts and partitions the data by a certain key, while the reduce input class merge-sorts its
data on the same key.
• Tez also provides what basically is a map-reduce compat layer that let’s one run MR jobs on
top of the new execution layer by implementing Map/Reduce concepts on the new
execution framework.
9. About Presto
• Presto is an open source distributed SQL query engine for running
interactive analytic queries against data sources of all sizes ranging from
gigabytes to petabytes.
• Presto was designed and written from the ground up for interactive
analytics and approaches the speed of commercial data warehouses
while scaling to the data size of organizations like LinkedIn and
Facebook.
10. About Dataset
• The input dataset was carefully chosen to cover not only the performance
perspective of benchmarking, but also to gain better insight into each of the system.
It gives a good understanding of the query patters they support, the functions, ease
of use etc.
• Different dimension tables, facts and Aggregates.
• Data ranges anywhere from 20k rows to 80+billion.
• Hive supports Complex datatypes like Struct, array, Union and Map. The data that we
chose has nested structure, key values and binary data.
• We Flattened the data for use in Presto as the 0.74 version of Presto supports only
Array and Map datatype. The underlying data is stored as JSON, so we have to use
json functions to extract and refer to the data.
• One of the dataset is a flat table with 600+ columns, specifically to test the columnar
functionality with Parquet, RCFile and ORC file formats.
11. Evaluation Criteria
• We chose 15 queries for our testing and benchmarking. These sqls are
some of the commonly used queries users run in DWH at LinkedIn.
• The queries test the following functionality:
• date and time manipulations
• nested sqls, wildcard searches,
• Filter predicates, partition pruning, Full table scans and Joins (3way, 2way etc).
• exists, in, not exists, not in
• Aggregate functions like sum, max, count(distinct), count(1)
• Extract keys from map, struct datatypes
12. Query 1 – simple groupby and count
select
trackingcode,
count(1)
from pageviewevent
where
datepartition='2014-07-15'
group by
trackingcode
limit 100;
13. Query 2 - case expression with filter predicates
SELECT
datepartition,
SUM(CASE when requestheader.pagekey in ('pulse-
saved-articles','pulse-settings','pulse-pbar','pulse-slice-
internal', 'pulse-share-hub','pulse-special-jobs-
economy','pulse-browse','pulse-slice-connections') then 1
when requestheader.pagekey in ('pulse-slice','pulse-top-
news') and (trackingcode NOT LIKE 'eml-tod%' OR
trackingcode IS NULL) then 1 else 0 end) AS TODAY_PV
FROM
pageviewevent
where
datepartition = '2014-07-15'
and header.memberid > 0
group by datepartition;
14. Query 3 – check count(distinct) with wildcard search
SELECT
a.datepartition,
d.country_sk,
COUNT(1) AS total_count,
count(distinct a.header.memberid) as unique_count
FROM pageviewevent a
INNER JOIN dim_tracking_code b ON
a.trackingcode=b.tracking_code
INNER JOIN dim_page_key c ON
a.requestheader.pagekey=c.page_key AND c.is_aggregate =
1
left outer join dim_member_cntry_lcl d on a.header.memberId
= d.member_sk
WHERE a.datepartition = '2014-07-18'
AND (
LOWER(a.trackingcode) LIKE 'eml_bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt_bt1%'
OR LOWER(a.trackingcode) LIKE 'eml-bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt-bt1%'
)
GROUP BY a.datepartition, country_sk ;
15. Query 4 – Joins, filter predicates with count(distinct)
SELECT
datepartition,
coalesce(c.country_sk,-9),
COUNT(DISTINCT a.header.memberid)
FROM pageviewevent a
inner join dim_page_key b on a.requestheader.pagekey =
b.page_key
and b.page_key_group_sk = 39
and b.is_aggregate = 1
left outer join
dim_member_cntry_lcl c on a.header.memberid=
c.member_sk
where a.datepartition = '2014-07-19'
and a.header.memberid > 0
group by
datepartition, coalesce(c.country_sk,-9);
16. Query 5 – test map datatype with filter predicates
select
substr(datepartition,1,10) as date_data,
campaigntypeint,
header.memberid,
channelid,
`format` as ad_format,
publisherid, campaignid,
advertiserid,
creativeid,
parameters['organicActivityId'] as activityid,
parameters['activityType'] as socialflag,
'0' as feedposition,
sum(case when statusint in (1,4) and channelid in (2,1) then 1 when statusint in (1,4)
and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as
imp,
sum(case when statusint = 1 and channelid in (2,1) then 1 when statusint = 1 and
channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as
imp_sas,
sum(case when channelid in (2000, 3000) and parameters['sequence'] > 0 then 1 else 0
end) as view_other,
sum(case when statusint = 1 then cost else 0.0 end) as rev_imp
from adimpressionevent
where
datepartition = '2014-07-20' and campaignTypeInt = 14
group by
substr(datepartition,1,10),
campaigntypeint, header.memberid, channelid, `format`, publisherid, campaignid,
advertiserid, creativeid,
parameters['organicActivityId'],
parameters['activityType']
limit 1000;
17. Query 6 – 2 table join with count(distinct)
select
count(distinct member_sk)
from dim_position p
join dim_company c
on c.company_sk=p.std_company_sk
and c.active='Y'
and c.company_type_sk=4
where
end_date is null
and is_primary ='Y';
18. Query 7 – 600+ column table test
select om.current_company as Company,
om.industry as Industry,
om.company_size as Company_Size,
om.current_title as Job_Title,
om.member_sk as Member_SK,
om.first_name as First_Name,
om.last_name as Last_Name,
om.email_address as Email,
om.connections as Connections ,
om.country as Country,
om.region as Region,
om.cropped_picture_id as Profile_Picture,
om.pref_locale as Pref_Locale,
om.headline as Headline
from om_segment om
where
om.ACTIVE_FLAG = 1
and om.country_sk in (162,78,75,2,57)
and om.connections > 99
and om.pageview_l30d > 0
and
(
( om.headline like '%linkedin%')
or (om.current_title like '%linkedin%')
or (
(om.headline like '%social media%' or om.headline like
'%social consultant%' or om.headline like '%social recruit%' or
om.headline like '%employer brand%')
and
(om.headline like '%train%' or om.headline like '%consult%' or
om.headline like '%advis%' or om.headline like '%recruit%')
)
or (
(om.current_title like '%social media%' or om.current_title like
'%social consultant%' or om.current_title like '%social recruit%' or
om.current_title like '%employer brand%')
and
(om.current_title like '%train%' or om.current_title like
'%consult%' or om.current_title like
'%advis%' or om.current_title like '%recruit%'
)
)
) ;
19. Query 8 – 3 table joins with uniques
select distinct f.member_sk
FROM
dim_education e join
dim_member_flat f on (e.member_sk = f.member_sk)
join
dim_school s on (e.school_sk = s.school_sk)
WHERE f.active_flag = 'Y' and (
( e.country_sk = 167 ) OR
( s.country_sk = 167 ) ) limit 1000;
20. Query 9 – wide table test (600+ columns - test columnar)
select member_sk
from om_segment
where
(lss_decision_maker_flag like 'DM' or
lss_decision_maker_flag like 'IC')
and (lss_company_tier like 'Enterprise' or
lss_company_tier like 'SMB' or lss_company_tier like
'SRM')
and (lss_customer_status like 'Prospect' or
lss_customer_status like 'Customer')
and (lss_subscriber_status like 'Online Gen Subscriber' or
lss_subscriber_status like 'Not a Subscriber')
and country_sk in
(14,194,174,95,154,167,227,37,102,78,162,163,70,193,2
1,132,59,101,2,242) limit 1000;
21. Query 10 – using sub-queries joins – push down
select
p.member_sk
from dim_position p
inner join (
select
position_sk,
std_title_2_sk,
member_sk
from
dim_position_std_title_2) pt
on p.position_sk = pt.position_sk
and p.member_sk = pt.member_sk
inner join (
select std_title_2_sk
from
dim_std_title_2 where std_title_2_id in
(17801,20923,11001,21845,8206,8136,22224,5204,13257,5642,8,16565,7
92,12949,13758)) t
on pt.std_title_2_sk = t.std_title_2_sk
inner join (
select company_sk
from dim_company
where company_size_sk > 2) c
on p.std_company_sk = c.company_sk
where p.end_date is null
and p.is_primary = 'Y' limit 1000;
22. Query 11 – test unionall
select
distinct member_sk
from (
select member_sk
from dim_education
where school_sk in (
9873, 10065, 10388, 9872, 7916, 10241, 10242, 9900,
10377, 10719, 10637, 8534, 8535, 9906)
union all
select member_sk
from dim_position
where final_company_sk in
(74701,74702,12831,159378,62771,67754,
75480,79641,73975,87156,1895741,147775)
or company_sk in
(74701,74702,12831,159378,62771,67754,75480,79641
,73975,87156,1895741,147775)
) x
limit 1000;
23. Query 12 – 3 table joins
create table u_smallem.retirement_members as
select distinct sds.member_sk
from u_smallem.v_retirement_dm sds inner join
dim_member_flat mem on mem.member_sk=sds.member_sk and
active_flag='Y' inner join
dim_position pos on sds.member_sk=pos.member_sk
where
(pos.final_seniority_2_sk in (6,7,9,10) OR
pos.user_supplied_title like '%senior consultant%')
UNION
select distinct current_date, mem.member_sk, 739, 4
from dim_position pos inner join
dim_member_flat mem on mem.member_sk=pos.member_sk and
active_flag='Y'
where
pos.final_company_sk
in(12254,24672,12694,16583,21410,38641,145164,32346,20918,35083,96
824,49506,159381,48201,45860,215432,53484,327842,63747,78721,1394
06,778800)
and (final_std_title_2_sk in (select std_title_2_sk as final_st_title_2_sk
from dim_std_title_2 where occupation_id=235)
or pos.user_supplied_title like '%benefit consultant%');
24. Query 13 – time based calculations
select distinct member_sk from (
select
member_sk,
start_date,
end_date, cast(from_unixtime(unix_timestamp()-
24*3600*90,'yyyyMM') as int) d1,
cast(year(from_unixtime(unix_timestamp())) as int)*100 d2,
source_created_ts
from dim_position ) x
where
start_date >= d1 or end_date >= d1
or ((start_date = d2 or end_date = d2)
and source_created_ts >= unix_timestamp()-24*3600*90)
limit 1000;
25. Query 14 – many small table joins
create table u_smallem.vs_rti_ad_order
as
select
o.ad_order_sk,
sum (r.ad_impressions) as impressions,
sum (r.ad_clicks) as clicks
from agg_daily_ad_revenue r
inner join dim_ad a on r.ad_sk = a.ad_sk
inner join dim_ad_order o on r.ad_order_sk = o.ad_order_sk
inner join dim_advertiser v on v.advertiser_sk = o.advertiser_sk
where r.datepartition >= '2014-07-01' and r.datepartition <= '2014-07-31'
and r.ad_creative_size_sk in (6,8,17,29)
and v.adv_saleschannel_name like 'Field%'
and o.lars_sales_channel_name like 'Advertising Field'
and r.ad_site_sk = 1
and r.ad_zone_sk <> 1175
and o.proposal_bind_id is not null
and
(coalesce(a.lars_product_type, 'n/a') not like 'Click Tracker' or coalesce(a.lars_product_type, 'n/a') not like 'inMail'
or coalesce(a.lars_target_type, 'n/a') not like 'Partner Message' or coalesce(a.lars_target_type, 'n/a') not like 'Polls'
)
group by o.ad_order_sk
having sum(r.ad_impressions) > 9999;
drop table if exists u_smallem.vs_final;
create table u_smallem.vs_final
as
select distinct i.member_sk from (
select member_sk, f.ad_order_sk, count(1) as impr from fact_detail_ad_impressions f join u_smallem.vs_rti_ad_order u on
f.ad_order_sk = u.ad_order_sk
where date_sk >= '2014-07-01' and date_sk <= '2014-07-07'
and ad_creative_size_sk in (6,17)
--and ad_order_sk in (select distinct ad_order_sk from u_smallem.vs_rti_ad_order)
group by member_sk, f.ad_order_sk
having count(1) > 10) i join om_segment o on i.member_sk = o.member_sk
where i.member_sk > 0 and o.pageview_l30d < 3000;
26. Query 15 – check not exists
drop table if exists u_smallem.tmp_SDS_AU;
create table u_smallem.tmp_SDS_AU
AS select distinct member_sk
from fact_bzops_follower f1
where
company_id = 2584270
and status='A'
and not exists (
select 1
from fact_bzops_follower f2
where
company_id = 3600
and status='A'
and f2.member_sk = f1.member_sk) ;
27. Query1 – Test concurrent users (Presto only)
• This exercise was performed for Presto only.
• Concurrency is measured by number of users
running query in parallel.
• For simplicity sake, we chose the same query
ran by 1 user, 2, 4, 8 and 12 users at the same
time.
• Queries 3 and 4 Failed with multiple
concurrent users which clearly indicates that
more memory is required on the system
• Multiple big table joins would fail on the
system when run concurrently.
Query3:
SELECT
datepartition,
coalesce(c.country_sk,-9),
COUNT(DISTINCT a.memberid)
FROM pageviewevent_flat a
inner join dim_page_key b on a.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1
left outer join dim_member_cntry_lcl c on a.memberid= c.member_sk
where a.datepartition = '2014-07-11'
and a.memberid > 0
group by datepartition, coalesce(c.country_sk,-9);
28. Query1 – Test linear growth (7 day window)
Query:
select trackingcode, count(1) from pageviewevent_flat
where
datepartition >= '2014-07-15' and datepartition <= '2014-07-16'
group by trackingcode limit 100;
• This exercise was performed on Presto and
hive-on-tez
• We chose Query 1 for this test
• Query 1 was ran with 1,2,4 and 7 day range.
29. Query3 – Test linear growth (7 day window)
Query:
SELECT
a.datepartition,
d.country_sk,
COUNT(1) AS total_count,
count(distinct a.memberid) as unique_count
FROM pageviewevent_flat a
INNER JOIN dim_page_key c
ON a.pagekey=c.page_key AND c.is_aggregate = 1
left outer join dim_member_cntry_lcl d
on a.memberId = d.member_sk
WHERE a.datepartition >= '2014-07-18'
and a.datepartition <= '2014-07-19'
AND (
LOWER(a.trackingcode) LIKE 'eml_bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt_bt1%'
OR LOWER(a.trackingcode) LIKE 'eml-bt1%'
OR LOWER(a.trackingcode) LIKE 'emlt-bt1%'
)
GROUP BY a.datepartition, country_sk ;
• This exercise was performed on Presto and
hive-on-tez
• We chose Query 3 for this test
• Query 3 was ran with 1,2,4 and 7 day range.
31. Conclusion
• Hive-on-Tez
• Pros:
• Environments that are running on Hive only can benefit from having Hive-on-tez.
• Hive-on-tez offers considerable improvement in query performance and offers an alternate
solution to MapReduce.
• In many cases we have seen queries speed up atleast 3x-8x compared to Hive.
• Switching to Hive-on-tez is extremely simple (set hive.execution.engine=tez)
• Cons:
• For this POC, we had to tweak many Hive configuration properties to get the optimal
performance for queries running on Tez. We felt this to be a drawback as we had to tune
parameters specific to certain queries. This may be a hindrance for ad-hoc queries.
• There were couple of queries that were running infinitely and we had to terminate them.
32. Conclusion
• Presto
• Pros:
• was proven to be fast and is a very good solution for ad-hoc analysis and faster table scans.
• Presto was 3x to 10x faster in almost many queries compared to Hive on MapReduce.
• The sql federation and query federation is an amazing feature for joining mysql or teradata
to Hive tables using Presto. This is similar to the Aster data SQL-H feature.
• Cons:
• Requires separate installation
• Memory was a big issue with Presto. Concurrency test that we did with multiple users
clearly indicates that memory was insufficient. Also, joining 2 big tables requires lot of
memory and running them on Presto clearly indicates that this is not going to work as it
doesn’t support distributed hash joins.
• DDLs are not supported.