6. The Problem
•
How many logins today?
•
How many individual users this week?
•
Total income today?
•
Paid user amount this month?
•
…
!6
7. The Problem: Facts
•
How many X during time period of Y
!
•
event
amount
login
-
1383729081
user_002
login
-
1383729082
user_001
!
user id
user_001
!
pay
4.99
1383729084
user_003
login
-
1383729090
Fact Table
!7
timestamp
8. The Problem: Facts
•
How many logins today?
•
How many individual users this week?
•
Total income today?
•
Paid user amount this month?
•
…
!8
9. The Problem: Facts
•
How many logins today?
!
•
event
amount
login
-
1383729081
user_002
login
-
1383729082
user_001
!
user id
user_001
!
pay
4.99
1383729084
user_003
login
-
1383729090
timestamp
select count(*) from fact where event=‘login’ and
date(timestamp)=‘2013-12-06’;
!9
10. The Problem: Facts
•
How many individual users this week?
!
•
event
amount
login
-
1383729081
user_002
login
-
1383729082
user_001
!
user id
user_001
!
timestamp
pay
4.99
1383729084
user_003
login
-
1383729090
select count(distinct uid) from fact where event=‘login’ and
timestamp>=‘?’ and timestamp<‘?’;
!10
11. The Problem: Facts
•
Total income today?
!
•
event
amount
login
-
1383729081
user_002
login
-
1383729082
user_001
!
user id
user_001
!
timestamp
pay
4.99
1383729084
user_003
login
-
1383729090
select sum(amount) from fact where event=‘pay’ and timestamp
>=‘?’ and timestamp<‘?’;
!11
12. The Problem: Facts
•
Paid user amount this month?
!
•
event
amount
login
-
1383729081
user_002
login
-
1383729082
user_001
!
user id
user_001
!
timestamp
pay
4.99
1383729084
user_003
login
-
1383729090
select count(distinct uid) from fact where event=‘pay’ and
timestamp >=‘?’ and timestamp<‘?’;
!12
13. The Problem: Dimensions
•
How many logins today from China?
•
How many individual users of each server this
week?
•
Total income today by new user?
•
Paid user amount this month from Adwords?
•
…
!13
14. The Problem: Dimensions
•
The user X’s property Y is of value Z
!
•
refer
en
adwords
user_002 20110927
cn
facebook
user_003 20121010
!
language
user_001 20100612
!
fr
admob
user_004 20130522
it
tapjoy
user id
reg_time
Dimension Table
!14
…
15. Fact & Dimension
•
Aggregation on Join
user id
user_001
user_002
user_001
user_003
user id
user_001
user_002
user_003
user_004
event
login
login
pay
login
amount
4.99
-
timestamp
1383729081
1383729082
1383729084
1383729090
reg_time language refer
20100612
en
adwords
20110927
cn
facebook
20121010
fr
admob
20130522
it
tapjoy
!15
…
16. Fact & Dimension
•
How many logins today from China?
•
How many individual users of each server this
week?
•
Total income today by new user?
•
Paid user amount this month from adwords?
•
…
!16
17. Fact & Dimension
SELECT COUNT DISTINCT (on uid)
JOIN (1 fact, n dimension, on uid)
WHERE (filter by value of dimensions/facts)
GROUP BY (value of dimension)
!17
18. Fact & Dimension
•
SQL
agg
•
-> Syntax tree
Join
•
•
-> Logical Plan
-> Physical Plan
Join
filter
filter
filter
scan:
Dimension
scan:
Dimension
scan:
Fact
32. about Space Efficiency
•
Compact data representation
•
•
Java object overhead: high
JVM friendly(GC)
•
Simpler object graph
•
Less tenured space, less full GC
!32
33. about Time Efficiency
•
Cache friendly
•
•
Superscalar: pipeline friendly
•
•
the inner loop problem
SIMD friendly
•
•
data access Locality
opportunity to operate on a vector of values
JVM friendly(JNI)
!33
38. Review the Considerations
•
name:VarCh
Cache friendly
•
Superscalar: pipeline friendly
•
SIMD friendly
•
Compact data representation
•
JVM friendly(GC)
•
JVM friendly(JNI)
!38
price.coupon:boole
i price.basic:flo
c
4.99
e
…
c
r
e
a
m
…
T
…
43. Adhoc batch query
Fact
user id
event
time
user_13
login
2013-07-26
user_13
login
2013-07-26
user_76
pay
2013-07-27
Dimension
user id
nation
user_13
cn
user_76
en
DAU
2013-07-26 2013-07-27
en
576
491
cn
361
945
!43
58. Jobs vs Predictions
•
Offline job
•
becomes predictions of what data user may
be interested in
•
by merging more query together
•
daily predictions & hourly predictions
!58
61. Utilising Multi-core
•
Now:
agg
•
Push data from Leaf
Join
•
•
Data driven upwards
Pooled execution
filter
nation=‘en’
scan:
Dimension
!61
filter
date=‘2013-07-26’
scan:
Fact
62. Adhoc batch query
•
Benefits
•
Reduce the same Scans
•
Merge similar Scans
•
Merge intermediate operators
•
Unified process for adhoc & batch process
•
Multi-core process of single Plan
!62
64. About Xingcloud
•
Now
•
•
2 billion insert/update daily
•
200k+ aggregation data/day, 6k sec in total
•
•
http://a.xingcloud.com
query response time: <1sec - 100 sec, 10 sec on avg.
Future
•
Plan Merge
•
Unified process for batch, adhoc & stream process, SQL oriented
•
SQL(t): Plan with time window
!64
65. About Drill
•
Now
•
•
on Parquet/ORCFile on HDFS
•
•
Distributed Join
Write interface of storage engines
Future
•
1.0 M2: December 2013
•
1.0 GA: Early 2014
•
more detail on https://issues.apache.org/jira/browse/DRILL
!65