SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Analyzing Real-World Data with Apache Drill 
Tomer Shiran 
VP Product Management, MapR Technologies 
Co-Founder, PMC Member and Committer, Apache Drill 
November 20, 2014 
® © 2014 MapR Technologies 1 
® 
© 2014 MapR Technologies
® © 2014 MapR Technologies 2 
Data is doubling in 
size every two years
44 ZETTABYTES 
® © 2014 MapR Technologies 3 
IDC estimates that in 2020, 
there will be 44 zettabytes 
of data in the world 
4.4 ZETTABYTES 
1.8 ZETTABYTES 
2011 2013 
2020 
Source: IDC Digital Universe
® © 2014 MapR Technologies 4 
UNSTRUCTURED 
DATA 
Unstructured data will account 
for more than 80% of the data 
collected by organizations 
STRUCTURED DATA 
1980 1990 2000 2010 2020 
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data 
Total Data Stored
NoSchema Datastores are Capturing this Data 
Volume MBs-GBs TBs-PBs 
Structured Structured, semi-structured and unstructured 
Planned (release cycle = months-years) Iterative (release cycle = days-weeks) 
RELATIONAL DATABASES “NOSCHEMA” DATASTORES 
Dynamic schema (schema-free) 
Application controls structure 
® © 2014 MapR Technologies 5 
Fixed schema 
DBA controls structure 
Structure 
Development 
Database 
1980 1990 2000 2010 2020
WANT 2 DON’T WANT 
® © 2014 MapR Technologies 6 
SQL in the Big Data World 
• SQL 
• BI (Tableau, MicroStrategy, etc.) 
• Low latency 
• Scalability 
• Create and maintain schemas on: 
– HDFS (Parquet, JSON, etc.) 
– HBase 
– MongoDB 
• Transform or copy data 
We want SQL and BI support without compromising the 
flexibility and agility of NoSchema datastores
• Schema-free scale-out query engine for Hadoop and NoSQL 
• Point-and-query vs. schema-first 
• Low latency 
• Extreme ease of use 
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs 
® © 2014 MapR Technologies 7 
APACHE DRILL 
40+ contributors 
150+ years of experience building 
databases and distributed systems
Evolution Towards Self-Service Data Exploration 
® © 2014 MapR Technologies 8 
Data Modeling and 
Transformation 
Data Visualization 
IT-driven 
IT-driven 
IT-driven 
Self-service 
IT-driven 
Self-service 
Not needed 
Self-service 
Traditional BI 
w/ RDBMS 
Self-Service BI 
w/ RDBMS SQL-on-Hadoop 
Self-Service 
Data Exploration 
Zero-day analytics
® © 2014 MapR Technologies 9
RDBMS/SQL-on-Hadoop table 
Apache Drill table 
® © 2014 MapR Technologies 10 
Drill’s Data Model is Flexible 
Fixed schema Schema-less 
HBase 
JSON 
BSON 
CSV 
TSV 
Parquet 
Avro 
Flat 
Complex 
Flexibility 
Flexibility 
Name! Gender! Age! 
Michael! M! 6! 
Jennifer! F! 3! 
{! 
name: {! 
first: Michael,! 
last: Smith! 
},! 
hobbies: [ski, soccer],! 
district: Los Altos! 
}! 
{! 
name: {! 
first: Jennifer,! 
last: Gates! 
},! 
hobbies: [sing],! 
preschool: CCLC! 
}!
Drill Supports Schema Discovery On-The-Fly 
Schema Declared In Advance Schema2 D iscovered On-The-Fly 
® © 2014 MapR Technologies 11 
• Fixed schema 
• Leverage schema in centralized 
repository (Hive Metastore) 
• Fixed schema, evolving schema or 
schema-less 
• Leverage schema in centralized 
repository or self-describing data 
SCHEMA ON 
WRITE 
SCHEMA 
BEFORE READ 
SCHEMA ON THE 
FLY
SELECT 
po_document.AllowPartialShipment 
FROM 
j_purchaseorder; 
® © 2014 MapR Technologies 12 
Native JSON 
SELECT 
json_value(po_document, 
'$.AllowPartialShipment’ 
RETURNING 
NUMBER) 
FROM 
j_purchaseorder; 
JSON query with Drill: 
JSON query with Oracle: 
Relational databases cannot provide true schema-free JSON support.
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 13 ® 
Architecture
® © 2014 MapR Technologies 14 
High Level Architecture 
• Cluster of commodity servers 
– Daemon (drillbit) on each node 
• No dependency on other execution engines (MapReduce, Spark, Tez) 
– Better performance and manageability 
• ZooKeeper maintains ephemeral cluster membership information 
– drillbit uses ZooKeeper to find other drillbits in the cluster 
– Client uses ZooKeeper to find drillbits 
• Data processing unit is columnar record batches 
– Enables schema flexibility with negligible performance impact
… 
ZooKeeper 
ZooKeeper 
ZooKeeper ® © 2014 MapR Technologies 15 
Drill Maximizes Data Locality 
drillbit 
DataNode/ 
RegionServer/ 
mongod 
drillbit 
DataNode/ 
RegionServer/ 
mongod 
drillbit 
DataNode/ 
RegionServer/ 
mongod 
Data Source Best Practice 
HDFS or MapR-FS drillbit on each DataNode 
HBase or MapR-DB drillbit on each RegionServer 
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
5. Return results 
to client 
® © 2014 MapR Technologies 16 
SELECT* Query Execution 
Client 
(JDBC, ODBC, 
REST) 
1. Find drillbits 
(once per session) 
2. Submit query to 
drillbit 
ZooKeeper drillbit 
3. Create logical and physical execution plans 
4. Farm out execution of fragments to cluster 
(completely distributed execution) 
ZooKeeper 
ZooKeeper 
drillbit 
drillbit 
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
DFS 
® © 2014 MapR Technologies 17 
Core Modules within drillbit 
SQL Parser 
Hive 
HBase 
Distributed Cache 
Storage Plugins 
MongoDB 
Physical Plan 
Execution 
Logical Plan 
Optimizer 
RPC Endpoint
Example: Analyzing Real-World Data 
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 18 ®
® © 2014 MapR Technologies 19 
Demo Plan 
1. Run Drill 
2. Configure DFS and MongoDB storage plugins 
3. Explore the data 
– Basics 
– Complex data 
– Views
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 20 ® 
Run Drill
Run Drill in Embedded Mode (sqlline) 
$ 
tar 
xf 
apache-­‐drill-­‐0.7.0.tar.gz 
$ 
cd 
apache-­‐drill-­‐0.7.0 
$ 
bin/sqlline 
-­‐u 
jdbc:drill:zk=local 
You can now access the Web UI: 
http://localhost:8047 
> 
SELECT 
* 
FROM 
dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json` 
LIMIT 
1; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
yelping_since 
| 
votes 
| 
review_count 
| 
name 
| 
user_id 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
2012-­‐02 
| 
{"funny":1,"useful":5,"cool":0} 
| 
6 
| 
Lee 
| 
qtrmBGNqCvupHMHL_bKFgQ 
| 
® © 2014 MapR Technologies 21 
• drillbit (Drill daemon) starts automatically in embedded mode 
• No ZooKeeper in embedded mode (hence zk=local) 
• Can’t use BI clients (JDBC/ODBC) in embedded mode
• Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf 
• Start drillbit: 
$ 
bin/drillbit.sh 
start 
® © 2014 MapR Technologies 22 
Or Run Drill in Distributed Mode… 
• Make sure ZooKeeper (zkServer) is running: 
$ 
zkServer 
start 
• Access the Web UI: http://localhost:8047 
• Connect a client to the cluster (eg, sqlline): 
$ 
bin/sqlline 
-­‐u 
jdbc:drill:zk=localhost:2181 
• Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes 
• If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired 
cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/ 
<clustername> 
• Not sure if ZooKeeper is running? Run telnet 
localhost 
2181 and make sure it connects
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 23 ® 
Configure Storage Plugins
® © 2014 MapR Technologies 24 
Enable MongoDB Storage Plugin
Define Workspaces in the DFS Storage Plugin 
• d 
® © 2014 MapR Technologies 25
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 26 ® 
Explore the Data: Basics
® © 2014 MapR Technologies 27 
Inventory: DFS Files 
{ 
"votes": 
{"funny": 
0, 
"useful": 
2, 
"cool": 
1}, 
"user_id": 
"Xqd0DzHaiyRqVH3WRG7hzg", 
"review_id": 
"15SdjuK7DmYqUAj6rjGowg", 
"stars": 
5, 
"date": 
"2007-­‐05-­‐17", 
"text": 
"dr. 
goldberg 
offers 
everything 
...", 
"type": 
"review", 
"business_id": 
"vcNAWiLM4dR7D2nwwJ7nCA" 
}
® © 2014 MapR Technologies 28 
Inventory: MongoDB Collections 
$ 
mongo 
MongoDB 
shell 
version: 
2.6.5 
> 
show 
databases; 
admin 
(empty) 
local 
0.078GB 
yelp 
0.453GB 
> 
use 
yelp 
> 
db.users.findOne() 
{ 
"_id" 
: 
ObjectId("54566cdf3237149de181a92a"), 
"yelping_since" 
: 
"2012-­‐02", 
"votes" 
: 
{ 
"funny" 
: 
1, 
"useful" 
: 
5, 
"cool" 
: 
0 
}, 
"review_count" 
: 
6, 
"name" 
: 
"Lee", 
"user_id" 
: 
"qtrmBGNqCvupHMHL_bKFgQ", 
"friends" 
: 
[ 
] 
}
Let’s Go! 
> 
SELECT 
* 
FROM 
dfs.root.`/Users/tshiran/Development/ 
demo/data/yelp/review.json` 
WHERE 
stars 
= 
1 
LIMIT 
1; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
votes 
| 
user_id 
| 
review_id 
| 
stars 
| 
date 
| 
text 
| 
type 
| 
business_id 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
{"funny":0,"useful":0,"cool":0} 
| 
Qrs3EICADUKNFoUq2iHStA 
| 
_ePLBPrkrf4bhyiKWEn4Qg 
| 
1 
| 
2013-­‐04-­‐19 
| 
I 
don't 
know 
what 
Dr. 
Goldberg 
was 
like 
before 
moving 
to 
Arizona, 
but 
let 
me 
tell 
you, 
STAY 
AWAY 
from 
this 
doctor 
and 
this 
office. 
| 
review 
| 
vcNAWiLM4dR7D2nwwJ7nCA 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
® © 2014 MapR Technologies 29
® © 2014 MapR Technologies 30 
Using Storage Plugins and Workspaces 
Storage plugin 
Workspace 
Path relative to workspace 
> 
SELECT 
* 
FROM 
dfs.root.`/Users/tshiran/Development/demo/data/ 
yelp/review.json` 
LIMIT 
1; 
> 
SELECT 
* 
FROM 
dfs.demo.`yelp/review.json` 
LIMIT 
1; 
> 
SELECT 
* 
FROM 
mongo.yelp.users 
LIMIT 
1; 
> 
USE 
mongo.yelp; 
> 
SELECT 
* 
FROM 
users 
LIMIT 
1; 
Storage Plugin Workspace Table 
dfs Path Path relative to workspace 
mongo Database Collection 
hive Database Table 
hbase Namespace Table
® © 2014 MapR Technologies 31 
Most Common User Names (MongoDB) 
> 
SELECT 
name, 
count(*) 
AS 
users 
FROM 
mongo.yelp.users 
GROUP 
BY 
name 
ORDER 
BY 
users 
DESC 
LIMIT 
10; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
users 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
David 
| 
2453 
| 
| 
John 
| 
2378 
| 
| 
Michael 
| 
2322 
| 
| 
Chris 
| 
2202 
| 
| 
Mike 
| 
2037 
| 
| 
Jennifer 
| 
1867 
| 
| 
Jessica 
| 
1463 
| 
| 
Jason 
| 
1457 
| 
| 
Michelle 
| 
1439 
| 
| 
Brian 
| 
1436 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
® © 2014 MapR Technologies 32 
Cities with the Most Businesses 
> 
SELECT 
state, 
city, 
count(*) 
AS 
businesses 
FROM 
dfs.demo.`/yelp/business.json` 
GROUP 
BY 
state, 
city 
ORDER 
BY 
businesses 
DESC 
LIMIT 
10; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
state 
| 
city 
| 
businesses 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
NV 
| 
Las 
Vegas 
| 
12021 
| 
| 
AZ 
| 
Phoenix 
| 
7499 
| 
| 
AZ 
| 
Scottsdale 
| 
3605 
| 
| 
EDH 
| 
Edinburgh 
| 
2804 
| 
| 
AZ 
| 
Mesa 
| 
2041 
| 
| 
AZ 
| 
Tempe 
| 
2025 
| 
| 
NV 
| 
Henderson 
| 
1914 
| 
| 
AZ 
| 
Chandler 
| 
1637 
| 
| 
WI 
| 
Madison 
| 
1630 
| 
| 
AZ 
| 
Glendale 
| 
1196 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 33 ® 
Explore the Data: Complex Data
® © 2014 MapR Technologies 34 
business.json (1) 
{ 
"business_id": 
"4bEjOyTaDG24SY5TxsaUNQ", 
"full_address": 
"3655 
Las 
Vegas 
Blvd 
SnThe 
StripnLas 
Vegas, 
NV 
89109", 
"hours": 
{ 
"Monday": 
{"close": 
"23:00", 
"open": 
"07:00"}, 
"Tuesday": 
{"close": 
"23:00", 
"open": 
"07:00"}, 
"Friday": 
{"close": 
"00:00", 
"open": 
"07:00"}, 
"Wednesday": 
{"close": 
"23:00", 
"open": 
"07:00"}, 
"Thursday": 
{"close": 
"23:00", 
"open": 
"07:00"}, 
"Sunday": 
{"close": 
"23:00", 
"open": 
"07:00"}, 
"Saturday": 
{"close": 
"00:00", 
"open": 
"07:00"} 
}, 
"open": 
true, 
"categories": 
["Breakfast 
& 
Brunch", 
"Steakhouses", 
"French", 
"Restaurants"], 
"city": 
"Las 
Vegas", 
"review_count": 
4084, 
"name": 
"Mon 
Ami 
Gabi", 
"neighborhoods": 
["The 
Strip"], 
"longitude": 
-­‐115.172588519464,
® © 2014 MapR Technologies 35 
business.json (2) 
"state": 
"NV", 
"stars": 
4.0, 
"attributes": 
{ 
"Alcohol": 
"full_bar”, 
"Noise 
Level": 
"average", 
"Has 
TV": 
false, 
"Attire": 
"casual", 
"Ambience": 
{ 
"romantic": 
true, 
"intimate": 
false, 
"touristy": 
false, 
"hipster": 
false, 
"classy": 
true, 
"trendy": 
false, 
"casual": 
false 
}, 
"Good 
For": 
{"dessert": 
false, 
"latenight": 
false, 
"lunch": 
false, 
"dinner": 
true, 
"breakfast": 
false, 
"brunch": 
false}, 
} 
}
Which Places Are Open Right Now (22:00)? 
> 
SELECT 
name, 
b.hours 
FROM 
dfs.demo.`yelp/business.json` 
b 
WHERE 
b.hours.Saturday.`open` 
< 
'22:00' 
AND 
® © 2014 MapR Technologies 36 
b.hours.Saturday.`close` 
> 
'22:00' 
LIMIT 
2; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
hours 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Chang 
Jiang 
Chinese 
Kitchen 
| 
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday": 
{"close":"22:30","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday": 
{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday": 
{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}} 
| 
| 
Grand 
China 
Restaurant 
| 
{"Tuesday":{"close":"22:00","open":"11:00"},"Friday": 
{"close":"23:00","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday": 
{"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday": 
{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","open":"11:00"}} 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
It’s 10pm in Vegas and I Want Good Hummus! 
> 
SELECT 
name, 
stars, 
b.hours.Friday, 
categories 
FROM 
dfs.demo.`yelp/business.json` 
b 
WHERE 
b.hours.Friday.`open` 
< 
'22:00' 
AND 
b.hours.Friday.`close` 
> 
'22:00' 
AND 
REPEATED_CONTAINS(categories, 
'Mediterranean') 
AND 
city 
= 
'Las 
Vegas' 
® © 2014 MapR Technologies 37 
ORDER 
BY 
stars 
DESC 
LIMIT 
2; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
stars 
| 
EXPR$2 
| 
categories 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Olives 
| 
4.0 
| 
{"close":"22:30","open":"11:00"} 
| 
["Mediterranean","Restaurants"] 
| 
| 
Marrakech 
Moroccan 
Restaurant 
| 
4.0 
| 
{"close":"23:00","open":"17:30"} 
| 
["Mediterranean","Middle 
Eastern","Moroccan","Restaurants"] 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
® © 2014 MapR Technologies 38 
Flatten Repeated Values 
> 
SELECT 
name, 
categories 
FROM 
dfs.demo.`yelp/business.json` 
LIMIT 
3; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
categories 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Eric 
Goldberg, 
MD 
| 
["Doctors","Health 
& 
Medical"] 
| 
| 
Pine 
Cone 
Restaurant 
| 
["Restaurants"] 
| 
| 
Deforest 
Family 
Restaurant 
| 
["American 
(Traditional)","Restaurants"] 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
> 
SELECT 
name, 
FLATTEN(categories) 
AS 
categories 
FROM 
dfs.demo.`yelp/business.json` 
LIMIT 
5; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
categories 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Eric 
Goldberg, 
MD 
| 
Doctors 
| 
| 
Eric 
Goldberg, 
MD 
| 
Health 
& 
Medical 
| 
| 
Pine 
Cone 
Restaurant 
| 
Restaurants 
| 
| 
Deforest 
Family 
Restaurant 
| 
American 
(Traditional) 
| 
| 
Deforest 
Family 
Restaurant 
| 
Restaurants 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
Most and Least Common Business Categories 
> 
SELECT 
category, 
count(*) 
AS 
businesses 
FROM 
(SELECT 
name, 
FLATTEN(categories) 
AS 
category 
® © 2014 MapR Technologies 39 
FROM 
dfs.demo.`yelp/business.json`) 
c 
GROUP 
BY 
category 
ORDER 
BY 
businesses 
DESC; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
category 
| 
businesses 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Restaurants 
| 
14303 
| 
… 
| 
Australian 
| 
1 
| 
| 
Boat 
Dealers 
| 
1 
| 
| 
Firewood 
| 
1 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
715 
rows 
selected 
(3.439 
seconds) 
> 
SELECT 
name, 
categories 
FROM 
dfs.demo.`yelp/business.json` 
WHERE 
true 
and 
REPEATED_CONTAINS(categories, 
'Australian'); 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
categories 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
The 
Australian 
AZ 
| 
["Bars","Burgers","Nightlife","Australian","Sports 
Bars","Restaurants"] 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 40 ® 
Explore the Data: Views
columns[0] 
columns[4] 
® © 2014 MapR Technologies 41 
Create a View for Name-Gender Mapping 
names.csv: 
> 
CREATE 
VIEW 
dfs.tmp.`names` 
AS 
SELECT 
columns[0] 
AS 
name, 
columns[4] 
AS 
gender 
FROM 
dfs.demo.`names.csv`; 
> 
USE 
dfs.tmp; 
> 
CREATE 
VIEW 
names1 
ASSELECT 
columns[0] 
AS 
name, 
columns[4] 
AS 
gender 
FROM 
dfs.demo.`names.csv`; 
> 
SELECT 
* 
FROM 
dfs.tmp.names 
WHERE 
name 
= 
'John'; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
gender 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
John 
| 
Male 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
Most Common Names (and their Genders) on Yelp 
> 
SELECT 
u.name, 
n.gender, 
count(*) 
AS 
number 
FROM 
mongo.yelp.users 
u, 
dfs.tmp.names 
n 
WHERE 
u.name 
= 
n.name 
GROUP 
BY 
u.name, 
n.gender 
ORDER 
BY 
number 
DESC 
LIMIT 
10; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
name 
| 
gender 
| 
number 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
David 
| 
Male 
| 
2453 
| 
| 
John 
| 
Male 
| 
2378 
| 
| 
Michael 
| 
Male 
| 
2322 
| 
| 
Chris 
| 
Unknown 
| 
2202 
| 
| 
Mike 
| 
Male 
| 
2037 
| 
| 
Jennifer 
| 
Female 
| 
1867 
| 
| 
Jessica 
| 
Female 
| 
1463 
| 
| 
Jason 
| 
Male 
| 
1457 
| 
| 
Michelle 
| 
Female 
| 
1439 
| 
| 
Brian 
| 
Male 
| 
1436 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
® © 2014 MapR Technologies 42
Who Rates Higher – Men or Women? 
> 
SELECT 
n.gender, 
count(*) 
AS 
users, 
round(avg(average_stars), 
2) 
stars 
FROM 
mongo.yelp.users 
u, 
dfs.tmp.names 
n 
WHERE 
u.name 
= 
n.name 
GROUP 
BY 
n.gender; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
gender 
| 
users 
| 
stars 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Female 
| 
103684 
| 
3.77 
| 
| 
Male 
| 
97430 
| 
3.696 
| 
| 
Unknown 
| 
18409 
| 
3.727 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
® © 2014 MapR Technologies 43
® © 2014 MapR Technologies 44 
Who Writes More – Men or Women? 
It takes a 3-way join to find out… 
> 
SELECT 
n.gender, 
round(avg(length(r.text))) 
AS 
review_length 
FROM 
dfs.demo.`yelp/review.json` 
r, 
mongo.yelp.users 
u, 
dfs.tmp.names 
n 
WHERE 
u.name 
= 
n.name 
AND 
r.user_id 
= 
u.user_id 
GROUP 
BY 
n.gender; 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
gender 
| 
review_length 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 
| 
Male 
| 
665 
| 
| 
Female 
| 
730 
| 
| 
Unknown 
| 
711 
| 
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
® © 2014 MapR Technologies 45 
Drill Tweets (@ApacheDrill)
® © 2014 MapR Technologies 46 
Thank You 
• Learn: incubator.apache.org/drill/ 
• Download: incubator.apache.org/drill/download/ 
• Ask questions: drill-user@incubator.apache.org 
• Contact me: tshiran@apache.org
® © 2014 MapR Technologies 47 
Thank You 
Tomer Shiran, VP Product Management 
@mapr maprtech 
tshiran@mapr.com 
MapRTechnologies 
maprtech 
mapr-technologies

Weitere ähnliche Inhalte

Was ist angesagt?

Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowJulien Le Dem
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowJulien Le Dem
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillDataWorks Summit
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillMapR Technologies
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 

Was ist angesagt? (20)

Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
MapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document DatabaseMapR-DB – The First In-Hadoop Document Database
MapR-DB – The First In-Hadoop Document Database
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Apache drill
Apache drillApache drill
Apache drill
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache Drill
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Pptx present
Pptx presentPptx present
Pptx present
 
Sept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical IntroductionSept 17 2013 - THUG - HBase a Technical Introduction
Sept 17 2013 - THUG - HBase a Technical Introduction
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 

Ähnlich wie Analyzing Real-World Data with Apache Drill

Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeMapR Technologies
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeDataWorks Summit
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionMapR Technologies
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Apache drill self service data exploration (113)
Apache drill   self service data exploration (113)Apache drill   self service data exploration (113)
Apache drill self service data exploration (113)MapR Technologies
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015EDB
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Inside Analysis
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Mathieu Dumoulin
 

Ähnlich wie Analyzing Real-World Data with Apache Drill (20)

Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Hadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big DataHadoop and the Future of SQL: Using BI Tools with Big Data
Hadoop and the Future of SQL: Using BI Tools with Big Data
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Webinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop SolutionWebinar: Selecting the Right SQL-on-Hadoop Solution
Webinar: Selecting the Right SQL-on-Hadoop Solution
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Apache drill self service data exploration (113)
Apache drill   self service data exploration (113)Apache drill   self service data exploration (113)
Apache drill self service data exploration (113)
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015Postgres.foreign.data.wrappers.2015
Postgres.foreign.data.wrappers.2015
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?Has Traditional MDM Finally Met its Match?
Has Traditional MDM Finally Met its Match?
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 

Kürzlich hochgeladen

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...ThinkInnovation
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...ThinkInnovation
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 

Kürzlich hochgeladen (16)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
Decision Making Under Uncertainty - Is It Better Off Joining a Partnership or...
 
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
Predictive Analysis - Using Insight-informed Data to Plan Inventory in Next 6...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Analyzing Real-World Data with Apache Drill

  • 1. Analyzing Real-World Data with Apache Drill Tomer Shiran VP Product Management, MapR Technologies Co-Founder, PMC Member and Committer, Apache Drill November 20, 2014 ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies
  • 2. ® © 2014 MapR Technologies 2 Data is doubling in size every two years
  • 3. 44 ZETTABYTES ® © 2014 MapR Technologies 3 IDC estimates that in 2020, there will be 44 zettabytes of data in the world 4.4 ZETTABYTES 1.8 ZETTABYTES 2011 2013 2020 Source: IDC Digital Universe
  • 4. ® © 2014 MapR Technologies 4 UNSTRUCTURED DATA Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored
  • 5. NoSchema Datastores are Capturing this Data Volume MBs-GBs TBs-PBs Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks) RELATIONAL DATABASES “NOSCHEMA” DATASTORES Dynamic schema (schema-free) Application controls structure ® © 2014 MapR Technologies 5 Fixed schema DBA controls structure Structure Development Database 1980 1990 2000 2010 2020
  • 6. WANT 2 DON’T WANT ® © 2014 MapR Technologies 6 SQL in the Big Data World • SQL • BI (Tableau, MicroStrategy, etc.) • Low latency • Scalability • Create and maintain schemas on: – HDFS (Parquet, JSON, etc.) – HBase – MongoDB • Transform or copy data We want SQL and BI support without compromising the flexibility and agility of NoSchema datastores
  • 7. • Schema-free scale-out query engine for Hadoop and NoSQL • Point-and-query vs. schema-first • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs ® © 2014 MapR Technologies 7 APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  • 8. Evolution Towards Self-Service Data Exploration ® © 2014 MapR Technologies 8 Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  • 9. ® © 2014 MapR Technologies 9
  • 10. RDBMS/SQL-on-Hadoop table Apache Drill table ® © 2014 MapR Technologies 10 Drill’s Data Model is Flexible Fixed schema Schema-less HBase JSON BSON CSV TSV Parquet Avro Flat Complex Flexibility Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }!
  • 11. Drill Supports Schema Discovery On-The-Fly Schema Declared In Advance Schema2 D iscovered On-The-Fly ® © 2014 MapR Technologies 11 • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  • 12. SELECT po_document.AllowPartialShipment FROM j_purchaseorder; ® © 2014 MapR Technologies 12 Native JSON SELECT json_value(po_document, '$.AllowPartialShipment’ RETURNING NUMBER) FROM j_purchaseorder; JSON query with Drill: JSON query with Oracle: Relational databases cannot provide true schema-free JSON support.
  • 13. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 13 ® Architecture
  • 14. ® © 2014 MapR Technologies 14 High Level Architecture • Cluster of commodity servers – Daemon (drillbit) on each node • No dependency on other execution engines (MapReduce, Spark, Tez) – Better performance and manageability • ZooKeeper maintains ephemeral cluster membership information – drillbit uses ZooKeeper to find other drillbits in the cluster – Client uses ZooKeeper to find drillbits • Data processing unit is columnar record batches – Enables schema flexibility with negligible performance impact
  • 15. … ZooKeeper ZooKeeper ZooKeeper ® © 2014 MapR Technologies 15 Drill Maximizes Data Locality drillbit DataNode/ RegionServer/ mongod drillbit DataNode/ RegionServer/ mongod drillbit DataNode/ RegionServer/ mongod Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
  • 16. 5. Return results to client ® © 2014 MapR Technologies 16 SELECT* Query Execution Client (JDBC, ODBC, REST) 1. Find drillbits (once per session) 2. Submit query to drillbit ZooKeeper drillbit 3. Create logical and physical execution plans 4. Farm out execution of fragments to cluster (completely distributed execution) ZooKeeper ZooKeeper drillbit drillbit * CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
  • 17. DFS ® © 2014 MapR Technologies 17 Core Modules within drillbit SQL Parser Hive HBase Distributed Cache Storage Plugins MongoDB Physical Plan Execution Logical Plan Optimizer RPC Endpoint
  • 18. Example: Analyzing Real-World Data ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 18 ®
  • 19. ® © 2014 MapR Technologies 19 Demo Plan 1. Run Drill 2. Configure DFS and MongoDB storage plugins 3. Explore the data – Basics – Complex data – Views
  • 20. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 20 ® Run Drill
  • 21. Run Drill in Embedded Mode (sqlline) $ tar xf apache-­‐drill-­‐0.7.0.tar.gz $ cd apache-­‐drill-­‐0.7.0 $ bin/sqlline -­‐u jdbc:drill:zk=local You can now access the Web UI: http://localhost:8047 > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json` LIMIT 1; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | yelping_since | votes | review_count | name | user_id | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | 2012-­‐02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | qtrmBGNqCvupHMHL_bKFgQ | ® © 2014 MapR Technologies 21 • drillbit (Drill daemon) starts automatically in embedded mode • No ZooKeeper in embedded mode (hence zk=local) • Can’t use BI clients (JDBC/ODBC) in embedded mode
  • 22. • Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf • Start drillbit: $ bin/drillbit.sh start ® © 2014 MapR Technologies 22 Or Run Drill in Distributed Mode… • Make sure ZooKeeper (zkServer) is running: $ zkServer start • Access the Web UI: http://localhost:8047 • Connect a client to the cluster (eg, sqlline): $ bin/sqlline -­‐u jdbc:drill:zk=localhost:2181 • Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes • If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/ <clustername> • Not sure if ZooKeeper is running? Run telnet localhost 2181 and make sure it connects
  • 23. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 23 ® Configure Storage Plugins
  • 24. ® © 2014 MapR Technologies 24 Enable MongoDB Storage Plugin
  • 25. Define Workspaces in the DFS Storage Plugin • d ® © 2014 MapR Technologies 25
  • 26. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 26 ® Explore the Data: Basics
  • 27. ® © 2014 MapR Technologies 27 Inventory: DFS Files { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-­‐05-­‐17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  • 28. ® © 2014 MapR Technologies 28 Inventory: MongoDB Collections $ mongo MongoDB shell version: 2.6.5 > show databases; admin (empty) local 0.078GB yelp 0.453GB > use yelp > db.users.findOne() { "_id" : ObjectId("54566cdf3237149de181a92a"), "yelping_since" : "2012-­‐02", "votes" : { "funny" : 1, "useful" : 5, "cool" : 0 }, "review_count" : 6, "name" : "Lee", "user_id" : "qtrmBGNqCvupHMHL_bKFgQ", "friends" : [ ] }
  • 29. Let’s Go! > SELECT * FROM dfs.root.`/Users/tshiran/Development/ demo/data/yelp/review.json` WHERE stars = 1 LIMIT 1; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | votes | user_id | review_id | stars | date | text | type | business_id | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | {"funny":0,"useful":0,"cool":0} | Qrs3EICADUKNFoUq2iHStA | _ePLBPrkrf4bhyiKWEn4Qg | 1 | 2013-­‐04-­‐19 | I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. | review | vcNAWiLM4dR7D2nwwJ7nCA | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ ® © 2014 MapR Technologies 29
  • 30. ® © 2014 MapR Technologies 30 Using Storage Plugins and Workspaces Storage plugin Workspace Path relative to workspace > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/ yelp/review.json` LIMIT 1; > SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; > SELECT * FROM mongo.yelp.users LIMIT 1; > USE mongo.yelp; > SELECT * FROM users LIMIT 1; Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table
  • 31. ® © 2014 MapR Technologies 31 Most Common User Names (MongoDB) > SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | users | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | David | 2453 | | John | 2378 | | Michael | 2322 | | Chris | 2202 | | Mike | 2037 | | Jennifer | 1867 | | Jessica | 1463 | | Jason | 1457 | | Michelle | 1439 | | Brian | 1436 | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 32. ® © 2014 MapR Technologies 32 Cities with the Most Businesses > SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | state | city | businesses | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 33. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 33 ® Explore the Data: Complex Data
  • 34. ® © 2014 MapR Technologies 34 business.json (1) { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -­‐115.172588519464,
  • 35. ® © 2014 MapR Technologies 35 business.json (2) "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  • 36. Which Places Are Open Right Now (22:00)? > SELECT name, b.hours FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND ® © 2014 MapR Technologies 36 b.hours.Saturday.`close` > '22:00' LIMIT 2; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | hours | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Chang Jiang Chinese Kitchen | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday": {"close":"22:30","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday": {"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday": {"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}} | | Grand China Restaurant | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday": {"close":"23:00","open":"11:00"},"Monday":{"close":"22:00","open":"11:00"},"Wednesday": {"close":"22:00","open":"11:00"},"Thursday":{"close":"22:00","open":"11:00"},"Sunday": {"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","open":"11:00"}} | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 37. It’s 10pm in Vegas and I Want Good Hummus! > SELECT name, stars, b.hours.Friday, categories FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ® © 2014 MapR Technologies 37 ORDER BY stars DESC LIMIT 2; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | stars | EXPR$2 | categories | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 38. ® © 2014 MapR Technologies 38 Flatten Repeated Values > SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | categories | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | categories | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 39. Most and Least Common Business Categories > SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category ® © 2014 MapR Technologies 39 FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | category | businesses | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Restaurants | 14303 | … | Australian | 1 | | Boat Dealers | 1 | | Firewood | 1 | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ 715 rows selected (3.439 seconds) > SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian'); +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | categories | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 40. ® © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsie s 40 ® Explore the Data: Views
  • 41. columns[0] columns[4] ® © 2014 MapR Technologies 41 Create a View for Name-Gender Mapping names.csv: > CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > USE dfs.tmp; > CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > SELECT * FROM dfs.tmp.names WHERE name = 'John'; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | gender | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | John | Male | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 42. Most Common Names (and their Genders) on Yelp > SELECT u.name, n.gender, count(*) AS number FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | name | gender | number | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | David | Male | 2453 | | John | Male | 2378 | | Michael | Male | 2322 | | Chris | Unknown | 2202 | | Mike | Male | 2037 | | Jennifer | Female | 1867 | | Jessica | Female | 1463 | | Jason | Male | 1457 | | Michelle | Female | 1439 | | Brian | Male | 1436 | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ ® © 2014 MapR Technologies 42
  • 43. Who Rates Higher – Men or Women? > SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | gender | users | stars | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Female | 103684 | 3.77 | | Male | 97430 | 3.696 | | Unknown | 18409 | 3.727 | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ ® © 2014 MapR Technologies 43
  • 44. ® © 2014 MapR Technologies 44 Who Writes More – Men or Women? It takes a 3-way join to find out… > SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender; +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | gender | review_length | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+ | Male | 665 | | Female | 730 | | Unknown | 711 | +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+
  • 45. ® © 2014 MapR Technologies 45 Drill Tweets (@ApacheDrill)
  • 46. ® © 2014 MapR Technologies 46 Thank You • Learn: incubator.apache.org/drill/ • Download: incubator.apache.org/drill/download/ • Ask questions: drill-user@incubator.apache.org • Contact me: tshiran@apache.org
  • 47. ® © 2014 MapR Technologies 47 Thank You Tomer Shiran, VP Product Management @mapr maprtech tshiran@mapr.com MapRTechnologies maprtech mapr-technologies