SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Self Service Data Exploration with
Apache Drill
{
Author: { “name” : “Aditya Kishore”, “github” : “adityakishore”, “twitter” : “@adiore” }
Presenter: {“name”:”Ted Dunning”, “github”: “tdunning”, “twitter”: “@ted_dunning”}
}
®
© 2014 MapR Technologies 2
Data is doubling in
size every two years
®
© 2014 MapR Technologies 3
2011 2013
In 2020 it is estimated to be 44 zettabytes of data in the world
2020
Source: IDC Digital Universe
44ZETTABYTES*
4.4ZETTABYTES
1.8ZETTABYTES
…
* Equivalent of 700 trillion 64GB iPhones
®
© 2014 MapR Technologies 4
UNSTRUCTURED
DATA
1980 2000 20101990 2020
Unstructured data will account for more than 80%
of the data collected by organizations
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
TotalDataStored
STRUCTURED
DATA
®
© 2014 MapR Technologies 5
Evolving distance to data
Business
(analysts,
developers)
“Plumbing”
development
Business
(analysts, developers)
Existing approaches require
a middleman (IT)
Data
Data
Data
Business
(analysts,
developers)
Modeling and
transformations
Map/Reduce
Traditional
SQL-on-Hadoop
New
SQL-on-Hadoop
®
© 2014 MapR Technologies 6
SQL in a NoSchema World
•  SQL
•  BI (Tableau, MicroStrategy, etc.)
•  Low latency
•  Scalability
•  Create and maintain schemas on:
–  HDFS (Parquet, JSON, etc.)
–  HBase
–  MongoDB
•  Transform or copy data
2 DON’T WANT WANT
®
© 2014 MapR Technologies 7
• Schema-free scale-out query engine for Hadoop and NoSQL
• Low latency
• Extreme ease of use
• Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
APACHE DRILL
®
© 2014 MapR Technologies 8
Drill’s Data Model is Flexible
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
®
© 2014 MapR Technologies 9
Running Drill takes 10 minutes
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  full_name	
  	
  	
  	
  |	
  position_title	
  |	
  	
  	
  salary	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Sheri	
  Nowmer	
  |	
  President	
  	
  	
  	
  	
  	
  |	
  80000.0	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
1	
  row	
  selected	
  (0.417	
  seconds)	
  
DOWNLOAD https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes
EXTRACT
$	
  tar	
  xf	
  apache-­‐drill-­‐0.7.0.tar.gz	
  
$	
  cd	
  apache-­‐drill-­‐0.7.0	
  
RUN $	
  bin/sqlline	
  -­‐u	
  jdbc:drill:zk=local	
  
>	
  SELECT	
  full_name,	
  position_title,	
  salary	
  
	
  	
  FROM	
  cp.`employee.json	
  `	
  
	
  	
  LIMIT	
  1;	
  QUERY
& step by step
In SQL format
®
© 2014 MapR Technologies 10
Introduce external data sources to Drill
Ø 
SELECT	
  *	
  FROM	
  dfs.root.`/
E:/drill/data/yelp/
review.json`;	
  
Ø 
SELECT	
  *	
  FROM	
  
dfs.yelp.`review.json`	
  
LIMIT	
  1;	
  
Ø 
USE	
  dfs.yelp;	
  
Ø 
SELECT	
  *	
  FROM	
  
`review.json`	
  LIMIT	
  1;	
  
Ø 
SELECT	
  *	
  FROM	
  hbase.users	
  
LIMIT	
  1;	
  
Storage Plugin
Provider
Workspace Table
files Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Coordinates:
Currently
Supported
Providers
. .
®
© 2014 MapR Technologies 11
Introduce external data sources to Drill
Storage Plugin
Provider
Workspace Table
files Path Path relative to workspace
mongo Database Collection
hive Database Table
hbase Namespace Table
Coordinates:
Currently
Supported
Providers
. .
Example:
Ø  SELECT	
  *	
  FROM	
  dfs.root.`/E:/drill/data/yelp/
review.json`;	
  
Ø  SELECT	
  *	
  FROM	
  dfs.yelp.`review.json`	
  LIMIT	
  1;	
  
Ø  USE	
  dfs.yelp;	
  
Ø  SELECT	
  *	
  FROM	
  `review.json`	
  LIMIT	
  1;	
  
Ø  SELECT	
  *	
  FROM	
  hbase.users	
  LIMIT	
  1;	
  
Ø 
SELECT	
  *	
  FROM	
  dfs.root.`/
E:/drill/data/yelp/
review.json`;	
  
Ø 
SELECT	
  *	
  FROM	
  
dfs.yelp.`review.json`	
  
LIMIT	
  1;	
  
Ø 
USE	
  dfs.yelp;	
  
Ø 
SELECT	
  *	
  FROM	
  
`review.json`	
  LIMIT	
  1;	
  
Ø 
SELECT	
  *	
  FROM	
  hbase.users	
  
LIMIT	
  1;	
  
®
© 2014 MapR Technologies 12
{	
  
	
  	
  "votes":	
  {"funny":	
  0,	
  "useful":	
  2,	
  "cool":	
  1},	
  
	
  	
  "user_id":	
  "Xqd0DzHaiyRqVH3WRG7hzg",	
  
	
  	
  "review_id":	
  "15SdjuK7DmYqUAj6rjGowg",	
  
	
  	
  "stars":	
  5,	
  
	
  	
  "date":	
  "2007-­‐05-­‐17",	
  
	
  	
  "text":	
  "dr.	
  goldberg	
  offers	
  everything	
  ...",	
  
	
  	
  "type":	
  "review",	
  
	
  	
  "business_id":	
  "vcNAWiLM4dR7D2nwwJ7nCA"	
  
}	
  
Inventory: DFS Files
®
© 2014 MapR Technologies 13
business.json (1)
{	
  
	
  "business_id":	
  "4bEjOyTaDG24SY5TxsaUNQ",	
  
	
  "full_address":	
  "3655	
  Las	
  Vegas	
  Blvd	
  SnThe	
  StripnLas	
  Vegas,	
  NV	
  89109",	
  
	
  "hours":	
  {	
  
	
   	
  "Monday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Tuesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Friday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Wednesday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Thursday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Sunday":	
  {"close":	
  "23:00",	
  "open":	
  "07:00"},	
  
	
   	
  "Saturday":	
  {"close":	
  "00:00",	
  "open":	
  "07:00"}	
  
	
  },	
  
	
  "open":	
  true,	
  
	
  "categories":	
  ["Breakfast	
  &	
  Brunch",	
  "Steakhouses",	
  "French",	
  "Restaurants"],	
  
	
  "city":	
  "Las	
  Vegas",	
  
	
  "review_count":	
  4084,	
  
	
  "name":	
  "Mon	
  Ami	
  Gabi",	
  
	
  "neighborhoods":	
  ["The	
  Strip"],	
  
	
  "longitude":	
  -­‐115.172588519464,	
  
®
© 2014 MapR Technologies 14
business.json (2)
	
  "state":	
  "NV",	
  
	
  "stars":	
  4.0,	
  
	
   	
  "attributes":	
  {	
  
	
   	
  "Alcohol":	
  "full_bar”,	
  
	
   	
   	
  "Noise	
  Level":	
  "average",	
  
	
   	
  "Has	
  TV":	
  false,	
  
	
   	
  "Attire":	
  "casual",	
  
	
   	
  "Ambience":	
  {	
  
	
   	
   	
  "romantic":	
  true,	
  
	
   	
   	
  "intimate":	
  false,	
  
	
   	
   	
  "touristy":	
  false,	
  
	
   	
   	
  "hipster":	
  false,	
  
	
   	
   	
   	
  "classy":	
  true,	
  
	
   	
   	
  "trendy":	
  false,	
  
	
   	
   	
   	
  "casual":	
  false	
  
	
   	
  },	
  
	
   	
  "Good	
  For":	
  {"dessert":	
  false,	
  "latenight":	
  false,	
  "lunch":	
  false,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  "dinner":	
  true,	
  "breakfast":	
  false,	
  "brunch":	
  false},	
  
	
  }	
  
}	
  
®
© 2014 MapR Technologies 15
Use cases
LAS VEGAS
NEW
RESTAURANT
®
© 2014 MapR Technologies 16
NEW RESTAURANT
Customers
for opening
party
>	
  SELECT	
  name,	
  review_count	
  
	
  	
  FROM	
  dfs.yelp.`user.json`	
  
	
  	
  ORDER	
  BY	
  review_count	
  DESC	
  
	
  	
  LIMIT	
  50;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  review_count	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Victor	
  	
  	
  	
  	
  |	
  8062	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Jennifer	
  	
  	
  |	
  4244	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Anita	
  	
  	
  	
  	
  	
  |	
  3829	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  ......	
  	
  	
  	
  	
  |	
  ....	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Eileen	
  	
  	
  	
  	
  |	
  1947	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  J	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  1946	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  Matt	
  	
  	
  	
  	
  	
  	
  |	
  1942	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
50	
  rows	
  selected	
  (1.16	
  seconds)	
  
®
© 2014 MapR Technologies 17
Cities
with most
businesses
NEW RESTAURANT
>	
  SELECT	
  state,	
  city,	
  COUNT(*)	
  AS	
  businesses	
  
	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  
	
  	
  	
  	
  GROUP	
  BY	
  state,	
  city	
  
	
  	
  	
  	
  ORDER	
  BY	
  reviews	
  DESC	
  LIMIT	
  10;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  state	
  	
  	
  	
  |	
  	
  	
  	
  city	
  	
  	
  	
  |	
  businesses	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Las	
  Vegas	
  	
  |	
  12021	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Phoenix	
  	
  	
  	
  |	
  7499	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Scottsdale	
  |	
  3605	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  EDH	
  	
  	
  	
  	
  	
  	
  	
  |	
  Edinburgh	
  	
  |	
  2804	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Mesa	
  	
  	
  	
  	
  	
  	
  |	
  2041	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Tempe	
  	
  	
  	
  	
  	
  |	
  2025	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  NV	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Henderson	
  	
  |	
  1914	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Chandler	
  	
  	
  |	
  1637	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  WI	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Madison	
  	
  	
  	
  |	
  1630	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  AZ	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  Glendale	
  	
  	
  |	
  1196	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2014 MapR Technologies 18
Use cases
LAS VEGAS
LAS VEGAS
RESTAURANT
®
© 2014 MapR Technologies 19
Open
restaurants
at 22:00
LAS VEGAS RESTAURANT
>	
  SELECT	
  name,	
  b.hours	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  b	
  
	
  	
  WHERE	
  b.hours.Saturday.`open`	
  <	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  b.hours.Saturday.`close`	
  >	
  '22:00'	
  
	
  	
  LIMIT	
  1;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  	
  	
  hours	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Chang	
  Jiang	
  Chinese	
  Kitchen	
  |	
  {"Tuesday":
{"close":"22:00","open":"11:00"},"Friday":
{"close":"22:30","open":"11:00"},"Monday":
{"close":"22:00","open":"11:00"},"Wednesday":
{"close":"22:00","open":"11:00"},"Thursday":
{"close":"22:00","open":"11:00"},"Sunday":
{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
1	
  row	
  selected	
  (0.013	
  seconds)	
  
	
  
®
© 2014 MapR Technologies 20
Finding
hummus
at 22:00
LAS VEGAS RESTAURANT
>	
  SELECT	
  name,	
  stars,	
  b.hours.Wednesday,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  b	
  
	
  	
  WHERE	
  b.hours.Wednesday.`open`	
  <	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  b.hours.Wednesday.`close`	
  >	
  '22:00'	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  REPEATED_CONTAINS(categories,	
  'Mediterranean')	
  
AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  city	
  =	
  'Las	
  Vegas'	
  
	
  	
  	
  	
  ORDER	
  BY	
  stars	
  DESC	
  
	
  	
  	
  	
  LIMIT	
  1;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  	
  	
  stars	
  	
  	
  	
  |	
  	
  	
  EXPR$2	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Marrakech	
  Moroccan	
  Restaurant	
  |	
  4.0	
  	
  	
  	
  	
  	
  	
  	
  |	
  {"close":"23:00","open":"17:30"}	
  |	
  
["Mediterranean","Middle	
  Eastern","Moroccan","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
1	
  row	
  selected	
  (2.185	
  seconds)	
  
®
© 2014 MapR Technologies 21
• Working with repeated values
APACHE DRILL
Unique benefits
®
© 2014 MapR Technologies 22
Flatten Repeated Values
>	
  SELECT	
  name,	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  2;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  ["Doctors","Health	
  &	
  Medical"]	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  ["Restaurants"]	
  |	
  
|	
  Deforest	
  Family	
  Restaurant	
  |	
  ["American	
  (Traditional)","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
	
  
>	
  SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  categories	
  
	
  	
  FROM	
  dfs.yelp.`business.json`	
  LIMIT	
  3;	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Doctors	
  	
  	
  	
  |	
  
|	
  Eric	
  Goldberg,	
  MD	
  |	
  Health	
  &	
  Medical	
  |	
  
|	
  Pine	
  Cone	
  Restaurant	
  |	
  Restaurants	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2014 MapR Technologies 23
Most and Least Common Business Categories
>	
  SELECT	
  category,	
  COUNT(*)	
  AS	
  businesses	
  
	
  	
  FROM	
  (SELECT	
  name,	
  FLATTEN(categories)	
  AS	
  category	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`)	
  
	
  	
  GROUP	
  BY	
  category	
  ORDER	
  BY	
  businesses	
  DESC;	
  
+------------+------------+
| category | businesses |
+------------+------------+
| Restaurants | 14303 |
| ............... |
| Firewood | 1 |
+------------+------------+
715 rows selected (3.439 seconds)	
  
	
  
>	
  SELECT	
  name,	
  categories	
  FROM	
  dfs.yelp.`business.json`	
  
	
  	
  WHERE	
  true	
  AND	
  REPEATED_CONTAINS(categories,	
  'Australian');	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  	
  	
  name	
  	
  	
  	
  |	
  categories	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  The	
  Australian	
  AZ	
  |	
  ["Bars","Burgers","Nightlife","Australian","Sports	
  Bars","Restaurants"]	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2014 MapR Technologies 24
• Views - Dynamic and Materialized
APACHE DRILL
®
© 2014 MapR Technologies 25
Create a view combining business and reviews datasets.
>	
  CREATE	
  OR	
  REPLACE	
  VIEW	
  dfs.tmp.BusinessReviews	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful,	
  r.votes.cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+------------+------------+
| ok | summary |
+------------+------------+
| true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema |
+------------+------------+
	
  
>	
  SELECT	
  COUNT(*)	
  AS	
  Total	
  FROM	
  dfs.tmp.BusinessReviews;	
  
	
  
+------------+
| Total |
+------------+
| 1125458 |
+------------+
®
© 2014 MapR Technologies 26
Materialized Views AKA Tables
>	
  ALTER	
  SESSION	
  SET	
  `store.format`	
  =	
  'parquet';	
  
	
  
>	
  CREATE	
  TABLE	
  dfs.tmp.BusinessReviewsTbl	
  AS	
  
	
  	
  	
  	
  SELECT	
  b.name,	
  b.stars,	
  r.votes.funny	
  funny,	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  r.votes.useful	
  useful,	
  r.votes.cool	
  cool,	
  r.`date`	
  
	
  	
  	
  	
  	
  	
  FROM	
  dfs.yelp.`business.json`	
  b,	
  dfs.yelp.`review.json`	
  r	
  
	
  	
  	
  	
  	
  	
  WHERE	
  r.business_id	
  =	
  b.business_id;	
  
	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  	
  Fragment	
  	
  |	
  Number	
  of	
  records	
  written	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
|	
  1_0	
  	
  	
  	
  	
  	
  	
  	
  |	
  176448	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_1	
  	
  	
  	
  	
  	
  	
  	
  |	
  192439	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_2	
  	
  	
  	
  	
  	
  	
  	
  |	
  198625	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_3	
  	
  	
  	
  	
  	
  	
  	
  |	
  200863	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_4	
  	
  	
  	
  	
  	
  	
  	
  |	
  181420	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
|	
  1_5	
  	
  	
  	
  	
  	
  	
  	
  |	
  175663	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  |	
  
+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+	
  
®
© 2014 MapR Technologies 27
DRILL ARCHITECTURE
Under the hood
®
© 2014 MapR Technologies 28
High Level Architecture
Cluster of commodity servers
–  Daemon (drillbit) on each node
ZooKeeper maintains ephemeral cluster membership information
–  Drillbit uses ZooKeeper to find other drillbits in the cluster
–  Client uses ZooKeeper to find drillbits
Built-in, optimistic query execution engine. Doesn’t require a
particular storage or execution system (MapReduce, Spark, Tez)
–  Better performance and manageability
Data processing unit is columnar record batches	
  
–  Enables schema flexibility with negligible performance impact
®
© 2014 MapR Technologies 29
Drill Maximizes Data Locality
Data Source Best Practice
HDFS or MapR-FS drillbit on each DataNode
HBase or MapR-DB drillbit on each RegionServer
MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
drillbit
DataNode/
RegionServer/
mongod
ZooKeeper
ZooKeeper
ZooKeeper
…
®
© 2014 MapR Technologies 30
Core Modules within drillbit	
  
SQL Parser
Hive
HBase
Distributed Cache
StoragePlugins
MongoDB
DFS
PhysicalPlan
ExecutionLogicalPlan Optimizer
RPC Endpoint
®
© 2014 MapR Technologies 31
SELECT * Query Execution
drillbit	
  
ZooKeeper
Client
(JDBC, ODBC,
REST)
1.  Find drillbits
(once per session)
3.  Create logical and physical execution plans
4.  Farm out execution of fragments to cluster
(completely distributed execution)
ZooKeeper
ZooKeeper
drillbit	
  drillbit	
  
2.  Submit query to
drillbit
5.  Return results
to client
* CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
®
© 2014 MapR Technologies 32
Participate
•  Learn: http://drill.apache.org/
•  Download: http://drill.apache.org/download/
•  Ask Questions: user@drill.apache.org
•  Engage on Twitter: @ApacheDrill
®
© 2014 MapR Technologies 33
Thank You
@mapr maprtech
aditya@mapr.com
Aditya Kishore
MapRTechnologies
maprtech
mapr-technologies
adi@apache.org
®
© 2014 MapR Technologies 34
Or Run Drill in Distributed Mode…
$	
  zkServer	
  start	
  
•  Make sure ZooKeeper (zkServer) is running:
•  Access the Web UI: http://localhost:8047
•  Connect a client to the cluster (e.g., sqlline):
•  Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes
•  If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired
cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/
<clustername>
•  Not sure if ZooKeeper is running? Run telnet	
  localhost	
  2181 and make sure it connects
•  Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf
•  Start drillbit:	
  
$	
  bin/drillbit.sh	
  start	
  
$	
  bin/sqlline	
  -­‐u	
  jdbc:drill:zk=localhost:2181	
  
®
© 2014 MapR Technologies 35
user.json
{	
  
	
  "yelping_since":	
  "2007-­‐08",	
  
	
  "votes":	
  {	
  
	
   	
  "funny":	
  198,	
  
	
   	
  "useful":	
  415,	
  
	
   	
  "cool":	
  206	
  
	
  },	
  
	
  "review_count":	
  283,	
  
	
  "name":	
  "Adele",	
  
	
  "user_id":	
  "9NJdKpRNwwaL4cvKq0cN6g",	
  
	
  "friends":	
  ["DrKQzBFAvxhyjLgbPSW2Qw",	
  "ebXx-­‐G5eFqWkfDuk22f81w",	
  "qWLezzHxOXN-­‐
GQdInixZzw"],	
  
	
  "fans":	
  10,	
  
	
  "average_stars":	
  3.6499999999999999,	
  
	
  "compliments":	
  {	
  
	
   	
  "funny":	
  4,	
  
	
   	
  "hot":	
  17,	
  
	
   	
  "cool":	
  20	
  
	
  },	
  
	
  "elite":	
  [2008,	
  2009,	
  2010,	
  2011,	
  2012,	
  2013,	
  2014]	
  
}	
  

Más contenido relacionado

Was ist angesagt?

Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1Charles Givre
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...The Hive
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseNag Arvind Gudiseva
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisDuyhai Doan
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesRussell Spitzer
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandranickmbailey
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraRustam Aliyev
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperDataWorks Summit
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 

Was ist angesagt? (20)

Data Exploration with Apache Drill: Day 1
Data Exploration with Apache Drill:  Day 1Data Exploration with Apache Drill:  Day 1
Data Exploration with Apache Drill: Day 1
 
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
Apache Drill: Building Highly Flexible, High Performance Query Engines by M.C...
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Apache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBaseApache Drill with Oracle, Hive and HBase
Apache Drill with Oracle, Hive and HBase
 
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 ParisReal time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Spark Cassandra Connector Dataframes
Spark Cassandra Connector DataframesSpark Cassandra Connector Dataframes
Spark Cassandra Connector Dataframes
 
Introduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scaleIntroduction to Apache Drill - interactive query and analysis at scale
Introduction to Apache Drill - interactive query and analysis at scale
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Spark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-CasesSpark cassandra connector.API, Best Practices and Use-Cases
Spark cassandra connector.API, Best Practices and Use-Cases
 
Lightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and CassandraLightning fast analytics with Spark and Cassandra
Lightning fast analytics with Spark and Cassandra
 
Hadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS DeveloperHadoop and Spark for the SAS Developer
Hadoop and Spark for the SAS Developer
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 

Ähnlich wie Introduction to Apache Drill - NYC Apache Drill Meetup

Self service data exploration with apache drill
Self service data exploration with apache drillSelf service data exploration with apache drill
Self service data exploration with apache drillMapR Technologies
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Spark Summit
 
Apache drill self service data exploration (113)
Apache drill   self service data exploration (113)Apache drill   self service data exploration (113)
Apache drill self service data exploration (113)MapR Technologies
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Gralldistributed matters
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistSpagoWorld
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
NoSQL's biggest lie: SQL never went away - Martin Esmann
NoSQL's biggest lie: SQL never went away - Martin EsmannNoSQL's biggest lie: SQL never went away - Martin Esmann
NoSQL's biggest lie: SQL never went away - Martin Esmanndistributed matters
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondIke Walker
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Keshav Murthy
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 
Building and Scaling the Internet of Things with MongoDB at Vivint
Building and Scaling the Internet of Things with MongoDB at Vivint Building and Scaling the Internet of Things with MongoDB at Vivint
Building and Scaling the Internet of Things with MongoDB at Vivint MongoDB
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5Keshav Murthy
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"DataStax Academy
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for CassandraEdward Capriolo
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesMapR Technologies
 

Ähnlich wie Introduction to Apache Drill - NYC Apache Drill Meetup (20)

Self service data exploration with apache drill
Self service data exploration with apache drillSelf service data exploration with apache drill
Self service data exploration with apache drill
 
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
Adding Complex Data to Spark Stack-(Neeraja Rentachintala, MapR)
 
Apache drill self service data exploration (113)
Apache drill   self service data exploration (113)Apache drill   self service data exploration (113)
Apache drill self service data exploration (113)
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
What and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual GrallWhat and Why and How: Apache Drill ! - Tugdual Grall
What and Why and How: Apache Drill ! - Tugdual Grall
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical EvangelistHUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
HUG Italy meet-up with Tugdual Grall, MapR Technical Evangelist
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
NoSQL's biggest lie: SQL never went away - Martin Esmann
NoSQL's biggest lie: SQL never went away - Martin EsmannNoSQL's biggest lie: SQL never went away - Martin Esmann
NoSQL's biggest lie: SQL never went away - Martin Esmann
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Practical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and BeyondPractical JSON in MySQL 5.7 and Beyond
Practical JSON in MySQL 5.7 and Beyond
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Building and Scaling the Internet of Things with MongoDB at Vivint
Building and Scaling the Internet of Things with MongoDB at Vivint Building and Scaling the Internet of Things with MongoDB at Vivint
Building and Scaling the Internet of Things with MongoDB at Vivint
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
NYC* 2013 - "Advanced Data Processing: Beyond Queries and Slices"
 
Intravert Server side processing for Cassandra
Intravert Server side processing for CassandraIntravert Server side processing for Cassandra
Intravert Server side processing for Cassandra
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Apache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL ReferencesApache Drill – Hands-On SQL References
Apache Drill – Hands-On SQL References
 

Último

Steps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic DevelopersSteps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic Developersmichealwillson701
 
BusinessGPT - SECURITY AND GOVERNANCE FOR GENERATIVE AI.pptx
BusinessGPT  - SECURITY AND GOVERNANCE  FOR GENERATIVE AI.pptxBusinessGPT  - SECURITY AND GOVERNANCE  FOR GENERATIVE AI.pptx
BusinessGPT - SECURITY AND GOVERNANCE FOR GENERATIVE AI.pptxAGATSoftware
 
If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...
If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...
If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...Maxim Salnikov
 
Technical improvements. Reasons. Methods. Estimations. CJ
Technical improvements.  Reasons. Methods. Estimations. CJTechnical improvements.  Reasons. Methods. Estimations. CJ
Technical improvements. Reasons. Methods. Estimations. CJpolinaucc
 
Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...
Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...
Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...MyFAA
 
openEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scaleopenEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scaleShane Coughlan
 
MUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow ModelsMUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow ModelsUniversity of Antwerp
 
VuNet software organisation powerpoint deck
VuNet software organisation powerpoint deckVuNet software organisation powerpoint deck
VuNet software organisation powerpoint deckNaval Singh
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Unlocking AI: Navigating Open Source vs. Commercial Frontiers
Unlocking AI:Navigating Open Source vs. Commercial FrontiersUnlocking AI:Navigating Open Source vs. Commercial Frontiers
Unlocking AI: Navigating Open Source vs. Commercial FrontiersRaphaël Semeteys
 
MinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young EntrepreneurMinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young EntrepreneurPriyadarshini T
 
Leveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDevLeveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDevpmgdscunsri
 
BATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data MeshBATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data MeshBATbern
 
Mobile App Development company Houston
Mobile  App  Development  company HoustonMobile  App  Development  company Houston
Mobile App Development company Houstonjennysmithusa549
 
User Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller ResumeUser Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller ResumeKaylee Miller
 
Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...
Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...
Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...Splashtop Inc
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityRandy Shoup
 
Revolutionize Your Field Service Management with FSM Grid
Revolutionize Your Field Service Management with FSM GridRevolutionize Your Field Service Management with FSM Grid
Revolutionize Your Field Service Management with FSM GridMathew Thomas
 
Enterprise Content Managements Solutions
Enterprise Content Managements SolutionsEnterprise Content Managements Solutions
Enterprise Content Managements SolutionsIQBG inc
 

Último (20)

Steps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic DevelopersSteps to Successfully Hire Ionic Developers
Steps to Successfully Hire Ionic Developers
 
BusinessGPT - SECURITY AND GOVERNANCE FOR GENERATIVE AI.pptx
BusinessGPT  - SECURITY AND GOVERNANCE  FOR GENERATIVE AI.pptxBusinessGPT  - SECURITY AND GOVERNANCE  FOR GENERATIVE AI.pptx
BusinessGPT - SECURITY AND GOVERNANCE FOR GENERATIVE AI.pptx
 
If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...
If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...
If your code could speak, what would it tell you? Let GitHub Copilot Chat hel...
 
Technical improvements. Reasons. Methods. Estimations. CJ
Technical improvements.  Reasons. Methods. Estimations. CJTechnical improvements.  Reasons. Methods. Estimations. CJ
Technical improvements. Reasons. Methods. Estimations. CJ
 
Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...
Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...
Take Advantage of Mx Tracking Flight Scheduling Solutions to Streamline Your ...
 
openEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scaleopenEuler Community Overview - a presentation showing the current scale
openEuler Community Overview - a presentation showing the current scale
 
MUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow ModelsMUT4SLX: Extensions for Mutation Testing of Stateflow Models
MUT4SLX: Extensions for Mutation Testing of Stateflow Models
 
VuNet software organisation powerpoint deck
VuNet software organisation powerpoint deckVuNet software organisation powerpoint deck
VuNet software organisation powerpoint deck
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Unlocking AI: Navigating Open Source vs. Commercial Frontiers
Unlocking AI:Navigating Open Source vs. Commercial FrontiersUnlocking AI:Navigating Open Source vs. Commercial Frontiers
Unlocking AI: Navigating Open Source vs. Commercial Frontiers
 
MinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young EntrepreneurMinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
MinionLabs_Mr. Gokul Srinivas_Young Entrepreneur
 
Leveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDevLeveling Up your Branding and Mastering MERN: Fullstack WebDev
Leveling Up your Branding and Mastering MERN: Fullstack WebDev
 
BATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data MeshBATbern52 Swisscom's Journey into Data Mesh
BATbern52 Swisscom's Journey into Data Mesh
 
Mobile App Development company Houston
Mobile  App  Development  company HoustonMobile  App  Development  company Houston
Mobile App Development company Houston
 
User Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller ResumeUser Experience Designer | Kaylee Miller Resume
User Experience Designer | Kaylee Miller Resume
 
20140812 - OBD2 Solution
20140812 - OBD2 Solution20140812 - OBD2 Solution
20140812 - OBD2 Solution
 
Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...
Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...
Splashtop Enterprise Brochure - Remote Computer Access and Remote Support Sof...
 
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of SimplicityLarge Scale Architecture -- The Unreasonable Effectiveness of Simplicity
Large Scale Architecture -- The Unreasonable Effectiveness of Simplicity
 
Revolutionize Your Field Service Management with FSM Grid
Revolutionize Your Field Service Management with FSM GridRevolutionize Your Field Service Management with FSM Grid
Revolutionize Your Field Service Management with FSM Grid
 
Enterprise Content Managements Solutions
Enterprise Content Managements SolutionsEnterprise Content Managements Solutions
Enterprise Content Managements Solutions
 

Introduction to Apache Drill - NYC Apache Drill Meetup

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Self Service Data Exploration with Apache Drill { Author: { “name” : “Aditya Kishore”, “github” : “adityakishore”, “twitter” : “@adiore” } Presenter: {“name”:”Ted Dunning”, “github”: “tdunning”, “twitter”: “@ted_dunning”} }
  • 2. ® © 2014 MapR Technologies 2 Data is doubling in size every two years
  • 3. ® © 2014 MapR Technologies 3 2011 2013 In 2020 it is estimated to be 44 zettabytes of data in the world 2020 Source: IDC Digital Universe 44ZETTABYTES* 4.4ZETTABYTES 1.8ZETTABYTES … * Equivalent of 700 trillion 64GB iPhones
  • 4. ® © 2014 MapR Technologies 4 UNSTRUCTURED DATA 1980 2000 20101990 2020 Unstructured data will account for more than 80% of the data collected by organizations Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data TotalDataStored STRUCTURED DATA
  • 5. ® © 2014 MapR Technologies 5 Evolving distance to data Business (analysts, developers) “Plumbing” development Business (analysts, developers) Existing approaches require a middleman (IT) Data Data Data Business (analysts, developers) Modeling and transformations Map/Reduce Traditional SQL-on-Hadoop New SQL-on-Hadoop
  • 6. ® © 2014 MapR Technologies 6 SQL in a NoSchema World •  SQL •  BI (Tableau, MicroStrategy, etc.) •  Low latency •  Scalability •  Create and maintain schemas on: –  HDFS (Parquet, JSON, etc.) –  HBase –  MongoDB •  Transform or copy data 2 DON’T WANT WANT
  • 7. ® © 2014 MapR Technologies 7 • Schema-free scale-out query engine for Hadoop and NoSQL • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs APACHE DRILL
  • 8. ® © 2014 MapR Technologies 8 Drill’s Data Model is Flexible HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  • 9. ® © 2014 MapR Technologies 9 Running Drill takes 10 minutes   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  full_name        |  position_title  |      salary      |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Sheri  Nowmer  |  President            |  80000.0        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   1  row  selected  (0.417  seconds)   DOWNLOAD https://cwiki.apache.org/confluence/display/DRILL/Apache+Drill+in+10+Minutes EXTRACT $  tar  xf  apache-­‐drill-­‐0.7.0.tar.gz   $  cd  apache-­‐drill-­‐0.7.0   RUN $  bin/sqlline  -­‐u  jdbc:drill:zk=local   >  SELECT  full_name,  position_title,  salary      FROM  cp.`employee.json  `      LIMIT  1;  QUERY & step by step In SQL format
  • 10. ® © 2014 MapR Technologies 10 Introduce external data sources to Drill Ø  SELECT  *  FROM  dfs.root.`/ E:/drill/data/yelp/ review.json`;   Ø  SELECT  *  FROM   dfs.yelp.`review.json`   LIMIT  1;   Ø  USE  dfs.yelp;   Ø  SELECT  *  FROM   `review.json`  LIMIT  1;   Ø  SELECT  *  FROM  hbase.users   LIMIT  1;   Storage Plugin Provider Workspace Table files Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table Coordinates: Currently Supported Providers . .
  • 11. ® © 2014 MapR Technologies 11 Introduce external data sources to Drill Storage Plugin Provider Workspace Table files Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table Coordinates: Currently Supported Providers . . Example: Ø  SELECT  *  FROM  dfs.root.`/E:/drill/data/yelp/ review.json`;   Ø  SELECT  *  FROM  dfs.yelp.`review.json`  LIMIT  1;   Ø  USE  dfs.yelp;   Ø  SELECT  *  FROM  `review.json`  LIMIT  1;   Ø  SELECT  *  FROM  hbase.users  LIMIT  1;   Ø  SELECT  *  FROM  dfs.root.`/ E:/drill/data/yelp/ review.json`;   Ø  SELECT  *  FROM   dfs.yelp.`review.json`   LIMIT  1;   Ø  USE  dfs.yelp;   Ø  SELECT  *  FROM   `review.json`  LIMIT  1;   Ø  SELECT  *  FROM  hbase.users   LIMIT  1;  
  • 12. ® © 2014 MapR Technologies 12 {      "votes":  {"funny":  0,  "useful":  2,  "cool":  1},      "user_id":  "Xqd0DzHaiyRqVH3WRG7hzg",      "review_id":  "15SdjuK7DmYqUAj6rjGowg",      "stars":  5,      "date":  "2007-­‐05-­‐17",      "text":  "dr.  goldberg  offers  everything  ...",      "type":  "review",      "business_id":  "vcNAWiLM4dR7D2nwwJ7nCA"   }   Inventory: DFS Files
  • 13. ® © 2014 MapR Technologies 13 business.json (1) {    "business_id":  "4bEjOyTaDG24SY5TxsaUNQ",    "full_address":  "3655  Las  Vegas  Blvd  SnThe  StripnLas  Vegas,  NV  89109",    "hours":  {      "Monday":  {"close":  "23:00",  "open":  "07:00"},      "Tuesday":  {"close":  "23:00",  "open":  "07:00"},      "Friday":  {"close":  "00:00",  "open":  "07:00"},      "Wednesday":  {"close":  "23:00",  "open":  "07:00"},      "Thursday":  {"close":  "23:00",  "open":  "07:00"},      "Sunday":  {"close":  "23:00",  "open":  "07:00"},      "Saturday":  {"close":  "00:00",  "open":  "07:00"}    },    "open":  true,    "categories":  ["Breakfast  &  Brunch",  "Steakhouses",  "French",  "Restaurants"],    "city":  "Las  Vegas",    "review_count":  4084,    "name":  "Mon  Ami  Gabi",    "neighborhoods":  ["The  Strip"],    "longitude":  -­‐115.172588519464,  
  • 14. ® © 2014 MapR Technologies 14 business.json (2)  "state":  "NV",    "stars":  4.0,      "attributes":  {      "Alcohol":  "full_bar”,        "Noise  Level":  "average",      "Has  TV":  false,      "Attire":  "casual",      "Ambience":  {        "romantic":  true,        "intimate":  false,        "touristy":  false,        "hipster":  false,          "classy":  true,        "trendy":  false,          "casual":  false      },      "Good  For":  {"dessert":  false,  "latenight":  false,  "lunch":  false,                                                  "dinner":  true,  "breakfast":  false,  "brunch":  false},    }   }  
  • 15. ® © 2014 MapR Technologies 15 Use cases LAS VEGAS NEW RESTAURANT
  • 16. ® © 2014 MapR Technologies 16 NEW RESTAURANT Customers for opening party >  SELECT  name,  review_count      FROM  dfs.yelp.`user.json`      ORDER  BY  review_count  DESC      LIMIT  50;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  review_count  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Victor          |  8062                  |   |  Jennifer      |  4244                  |   |  Anita            |  3829                  |   |  ......          |  ....                  |   |  Eileen          |  1947                  |   |  J                    |  1946                  |   |  Matt              |  1942                  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   50  rows  selected  (1.16  seconds)  
  • 17. ® © 2014 MapR Technologies 17 Cities with most businesses NEW RESTAURANT >  SELECT  state,  city,  COUNT(*)  AS  businesses          FROM  dfs.yelp.`business.json`          GROUP  BY  state,  city          ORDER  BY  reviews  DESC  LIMIT  10;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |      state        |        city        |  businesses  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  NV                  |  Las  Vegas    |  12021            |   |  AZ                  |  Phoenix        |  7499              |   |  AZ                  |  Scottsdale  |  3605              |   |  EDH                |  Edinburgh    |  2804              |   |  AZ                  |  Mesa              |  2041              |   |  AZ                  |  Tempe            |  2025              |   |  NV                  |  Henderson    |  1914              |   |  AZ                  |  Chandler      |  1637              |   |  WI                  |  Madison        |  1630              |   |  AZ                  |  Glendale      |  1196              |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 18. ® © 2014 MapR Technologies 18 Use cases LAS VEGAS LAS VEGAS RESTAURANT
  • 19. ® © 2014 MapR Technologies 19 Open restaurants at 22:00 LAS VEGAS RESTAURANT >  SELECT  name,  b.hours      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Saturday.`open`  <  '22:00'  AND                  b.hours.Saturday.`close`  >  '22:00'      LIMIT  1;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      hours        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Chang  Jiang  Chinese  Kitchen  |  {"Tuesday": {"close":"22:00","open":"11:00"},"Friday": {"close":"22:30","open":"11:00"},"Monday": {"close":"22:00","open":"11:00"},"Wednesday": {"close":"22:00","open":"11:00"},"Thursday": {"close":"22:00","open":"11:00"},"Sunday": {"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","open":"11:00"}}  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   1  row  selected  (0.013  seconds)    
  • 20. ® © 2014 MapR Technologies 20 Finding hummus at 22:00 LAS VEGAS RESTAURANT >  SELECT  name,  stars,  b.hours.Wednesday,  categories      FROM  dfs.yelp.`business.json`  b      WHERE  b.hours.Wednesday.`open`  <  '22:00'  AND                  b.hours.Wednesday.`close`  >  '22:00'  AND                  REPEATED_CONTAINS(categories,  'Mediterranean')   AND                  city  =  'Las  Vegas'          ORDER  BY  stars  DESC          LIMIT  1;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |      stars        |      EXPR$2      |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Marrakech  Moroccan  Restaurant  |  4.0                |  {"close":"23:00","open":"17:30"}  |   ["Mediterranean","Middle  Eastern","Moroccan","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   1  row  selected  (2.185  seconds)  
  • 21. ® © 2014 MapR Technologies 21 • Working with repeated values APACHE DRILL Unique benefits
  • 22. ® © 2014 MapR Technologies 22 Flatten Repeated Values >  SELECT  name,  categories      FROM  dfs.yelp.`business.json`  LIMIT  2;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  ["Doctors","Health  &  Medical"]  |   |  Pine  Cone  Restaurant  |  ["Restaurants"]  |   |  Deforest  Family  Restaurant  |  ["American  (Traditional)","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+     >  SELECT  name,  FLATTEN(categories)  AS  categories      FROM  dfs.yelp.`business.json`  LIMIT  3;   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  Eric  Goldberg,  MD  |  Doctors        |   |  Eric  Goldberg,  MD  |  Health  &  Medical  |   |  Pine  Cone  Restaurant  |  Restaurants  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 23. ® © 2014 MapR Technologies 23 Most and Least Common Business Categories >  SELECT  category,  COUNT(*)  AS  businesses      FROM  (SELECT  name,  FLATTEN(categories)  AS  category                          FROM  dfs.yelp.`business.json`)      GROUP  BY  category  ORDER  BY  businesses  DESC;   +------------+------------+ | category | businesses | +------------+------------+ | Restaurants | 14303 | | ............... | | Firewood | 1 | +------------+------------+ 715 rows selected (3.439 seconds)     >  SELECT  name,  categories  FROM  dfs.yelp.`business.json`      WHERE  true  AND  REPEATED_CONTAINS(categories,  'Australian');   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |        name        |  categories  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  The  Australian  AZ  |  ["Bars","Burgers","Nightlife","Australian","Sports  Bars","Restaurants"]  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 24. ® © 2014 MapR Technologies 24 • Views - Dynamic and Materialized APACHE DRILL
  • 25. ® © 2014 MapR Technologies 25 Create a view combining business and reviews datasets. >  CREATE  OR  REPLACE  VIEW  dfs.tmp.BusinessReviews  AS          SELECT  b.name,  b.stars,  r.votes.funny,                        r.votes.useful,  r.votes.cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +------------+------------+ | ok | summary | +------------+------------+ | true | View 'BusinessReviews' created successfully in 'dfs.tmp' schema | +------------+------------+   >  SELECT  COUNT(*)  AS  Total  FROM  dfs.tmp.BusinessReviews;     +------------+ | Total | +------------+ | 1125458 | +------------+
  • 26. ® © 2014 MapR Technologies 26 Materialized Views AKA Tables >  ALTER  SESSION  SET  `store.format`  =  'parquet';     >  CREATE  TABLE  dfs.tmp.BusinessReviewsTbl  AS          SELECT  b.name,  b.stars,  r.votes.funny  funny,                        r.votes.useful  useful,  r.votes.cool  cool,  r.`date`              FROM  dfs.yelp.`business.json`  b,  dfs.yelp.`review.json`  r              WHERE  r.business_id  =  b.business_id;     +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |    Fragment    |  Number  of  records  written  |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+   |  1_0                |  176448                                        |   |  1_1                |  192439                                        |   |  1_2                |  198625                                        |   |  1_3                |  200863                                        |   |  1_4                |  181420                                        |   |  1_5                |  175663                                        |   +-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐+  
  • 27. ® © 2014 MapR Technologies 27 DRILL ARCHITECTURE Under the hood
  • 28. ® © 2014 MapR Technologies 28 High Level Architecture Cluster of commodity servers –  Daemon (drillbit) on each node ZooKeeper maintains ephemeral cluster membership information –  Drillbit uses ZooKeeper to find other drillbits in the cluster –  Client uses ZooKeeper to find drillbits Built-in, optimistic query execution engine. Doesn’t require a particular storage or execution system (MapReduce, Spark, Tez) –  Better performance and manageability Data processing unit is columnar record batches   –  Enables schema flexibility with negligible performance impact
  • 29. ® © 2014 MapR Technologies 29 Drill Maximizes Data Locality Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node) drillbit DataNode/ RegionServer/ mongod drillbit DataNode/ RegionServer/ mongod drillbit DataNode/ RegionServer/ mongod ZooKeeper ZooKeeper ZooKeeper …
  • 30. ® © 2014 MapR Technologies 30 Core Modules within drillbit   SQL Parser Hive HBase Distributed Cache StoragePlugins MongoDB DFS PhysicalPlan ExecutionLogicalPlan Optimizer RPC Endpoint
  • 31. ® © 2014 MapR Technologies 31 SELECT * Query Execution drillbit   ZooKeeper Client (JDBC, ODBC, REST) 1.  Find drillbits (once per session) 3.  Create logical and physical execution plans 4.  Farm out execution of fragments to cluster (completely distributed execution) ZooKeeper ZooKeeper drillbit  drillbit   2.  Submit query to drillbit 5.  Return results to client * CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
  • 32. ® © 2014 MapR Technologies 32 Participate •  Learn: http://drill.apache.org/ •  Download: http://drill.apache.org/download/ •  Ask Questions: user@drill.apache.org •  Engage on Twitter: @ApacheDrill
  • 33. ® © 2014 MapR Technologies 33 Thank You @mapr maprtech aditya@mapr.com Aditya Kishore MapRTechnologies maprtech mapr-technologies adi@apache.org
  • 34. ® © 2014 MapR Technologies 34 Or Run Drill in Distributed Mode… $  zkServer  start   •  Make sure ZooKeeper (zkServer) is running: •  Access the Web UI: http://localhost:8047 •  Connect a client to the cluster (e.g., sqlline): •  Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes •  If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/ <clustername> •  Not sure if ZooKeeper is running? Run telnet  localhost  2181 and make sure it connects •  Define the Drill cluster name and ZooKeeper nodes in conf/drill-­‐override.conf •  Start drillbit:   $  bin/drillbit.sh  start   $  bin/sqlline  -­‐u  jdbc:drill:zk=localhost:2181  
  • 35. ® © 2014 MapR Technologies 35 user.json {    "yelping_since":  "2007-­‐08",    "votes":  {      "funny":  198,      "useful":  415,      "cool":  206    },    "review_count":  283,    "name":  "Adele",    "user_id":  "9NJdKpRNwwaL4cvKq0cN6g",    "friends":  ["DrKQzBFAvxhyjLgbPSW2Qw",  "ebXx-­‐G5eFqWkfDuk22f81w",  "qWLezzHxOXN-­‐ GQdInixZzw"],    "fans":  10,    "average_stars":  3.6499999999999999,    "compliments":  {      "funny":  4,      "hot":  17,      "cool":  20    },    "elite":  [2008,  2009,  2010,  2011,  2012,  2013,  2014]   }