Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Killing ETL with Drill
Charles S. Givre
@cgivre
cgivre@thedataist.com
The problems
We want SQL and BI support
without compromising flexibility
and ability of NoSchema datastores.
Data is not arranged in an
optimal way for ad-hoc analysis
Data is not arranged in an
optimal way for ad-hoc analysis
ETL Data Warehouse
Analytics teams spend between
50%-90% of their time preparing
their data.
76% of Data Scientist say this is the
least enjoyable part of their job.
http://visit.crowdflower.com/rs/416-ZBE-142/images...
The ETL Process consumes the
most time and contributes almost
no value to the end product.
ETL Data Warehouse
“Any sufficiently advanced
technology is indistinguishable
from magic”
—Arthur C. Clarke
You just query the data…
no schema
Drill is NOT just SQL on Hadoop
Drill scales
Drill is open source
Download Drill at: drill.apache.org
Why should you use Drill?
Why should you use Drill?
Drill is easy to use
Drill is easy to use
Drill uses standard ANSI SQL
Drill is FAST!!
https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
Quick Demo
Thank you Jair Aguirre!!
Quick Demo
seanlahman.com/baseball-archive/statistics
Quick Demo
data = load '/user/cloudera/data/baseball_csv/Teams.csv' using PigStorage(',');
filtered = filter data by ($0 =...
Quick Demo
SELECT columns[40], cast(columns[19] as int) AS HR
FROM `baseball_csv/Teams.csv`
WHERE columns[0] = '1988'
ORDE...
Drill is Versatile
NoSQL, No Problem
NoSQL, No Problem
https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
NoSQL, No Problem
https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
SELECT t.addres...
Querying Across Silos
Querying Across Silos
Farmers Market Data Restaurant Data
Querying Across Silos
SELECT t1.Borough, t1.markets, t2.rests, cast(t1.markets AS
FLOAT)/ cast(t2.rests AS FLOAT) AS ratio...
Querying Across Silos
Execution Time: 0.502 Seconds
To follow along, please download the
files at:
https://github.com/cgivre/drillworkshop
Querying Drill
Querying Drill
SELECT DISTINCT management_role FROM cp.`employee.json`;
Querying Drill
http://localhost:8047
Querying Drill
SELECT * FROM cp.`employee.json` LIMIT 20
Querying Drill
SELECT * FROM cp.`employee.json` LIMIT 20
Querying Drill
SELECT <fields>
FROM <table>
WHERE <optional logical condition>
Querying Drill
SELECT name, address, email
FROM customerData
WHERE age > 20
Querying Drill
SELECT name, address, email
FROM dfs.logs.`/data/customers.csv`
WHERE age > 20
Querying Drill
FROM dfs.logs.`/data/customers.csv`
Storage Plugin Workspace Table
Querying Drill
Plugins Supported Description
cp Queries files in the Java ClassPath
dfs
File System. Can connect to remote ...
Problem: You have multiple log files
which you would like to analyze
Problem: You have multiple log files which you
would like to analyze
• In the sample data files, there is a folder called ‘l...
SELECT *
FROM dfs.drillworkshop.`logs/`
LIMIT 10
SELECT *
FROM dfs.drillworkshop.`logs/`
LIMIT 10
dirn accesses the
subdirectories
dirn accesses the
subdirectories
SELECT *
FROM dfs.drilldata.`logs/`
WHERE dir0 = ‘2013’
Function Description
MAXDIR(), MINDIR() Limit query to the first or last directory
IMAXDIR(), IMINDIR()
Limit query to the ...
In Class Exercise:
Find the total number of items sold by year and the total
dollar sales in each year.
HINT: Don’t forget...
Let’s look at JSON data
Let’s look at JSON data
[
{
"name": "Farley, Colette L.",
"email": "iaculis@atarcu.ca",
"DOB": "2011-08-14",
"phone": "1-7...
Let’s look at JSON data
SELECT *
FROM dfs.drillworkshop.`json/customers.json`
Let’s look at JSON data
SELECT *
FROM dfs.drillworkshop.`json/customers.json`
Let’s look at JSON data
SELECT *
FROM dfs.drillworkshop.`json/customers.json`
What about nested data?
Please open
baltimore_salaries.json
in a text editor
{
"meta" : {
"view" : {
"id" : "nsfe-bg53",
"name" : "Baltimore City Employee Salaries FY2015",
"attribution" : "Mayor's O...
{
"meta" : {
"view" : {
"id" : "nsfe-bg53",
"name" : "Baltimore City Employee Salaries FY2015",
"attribution" : "Mayor's O...
{
"meta" : {
"view" : {
"id" : "nsfe-bg53",
"name" : "Baltimore City Employee Salaries FY2015",
"attribution" : "Mayor's O...
"data" : [
[ 1,
"66020CF9-8449-4464-AE61-B2292C7A0F2D",
1,
1438255843,
"393202",
1438255843,
“393202",
null,
"Aaron,Patric...
Drill has a series of functions
for nested data
Let’s look at this data in Drill
Let’s look at this data in Drill
SELECT *
FROM dfs.drillworkshop.`baltimore_salaries.json`
Let’s look at this data in Drill
SELECT *
FROM dfs.drillworkshop.`baltimore_salaries.json`
Let’s look at this data in Drill
SELECT data
FROM dfs.drillworkshop.`baltimore_salaries.json`
FLATTEN( <json array> )
separates elements in a repeated
field into individual records.
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
SELECT raw_data[8] AS name …
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`baltimore_salaries.json`
)
SELECT raw_data[8] AS name, raw_data[9] AS job_title
FROM
(
SELECT FLATTEN( data ) AS raw_data
FROM dfs.drillworkshop.`bal...
SELECT raw_data[9] AS job_title,
AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary,
COUNT( DISTINCT raw_data[8] ) AS per...
Using the JSON file, recreate the earlier query to find the average
salary by job title and how many people have each job ti...
Log Files
Log Files
• Drill does not natively support reading log files… yet
• If you are NOT using Merlin, included in the GitHub re...
Log Files
070823 21:00:32 1 Connect root@localhost on test1
070823 21:00:48 1 Query show tables
070823 21:00:56 1 Query se...
log": {
"type": "log",
"extensions": [
"log"
],
"fieldNames": [
"date",
"time",
"pid",
"action",
"query"
],
"pattern": "(d...
SELECT *
FROM dfs.drillworkshop.`log_files/mysql.log`
SELECT *
FROM dfs.drillworkshop.`log_files/mysql.log`
HTTPD Log Files
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
195.154.46.135 - - [25/Oct/2015:04:1...
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
195.154.46.135 - - [25/Oct/2015:04:1...
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
SELECT *
FROM dfs.drillworkshop.`dat...
HTTPD Log Files
For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423
SELECT *
FROM dfs.drillworkshop.`dat...
HTTPD Log Files
SELECT request_referer, parse_url( request_referer ) AS url_data
FROM dfs.drillworkshop.`data_files/log_fi...
HTTPD Log Files
SELECT request_referer, parse_url( request_referer ) AS url_data
FROM dfs.drillworkshop.`data_files/log_fi...
Networking Functions
Networking Functions
• inet_aton( <ip> ): Converts an IPv4 Address to an integer
• inet_ntoa( <int> ): Converts an integer...
PCAP Files
SELECT *
FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
SELECT *
FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
Connecting other Data Sources
Connecting other Data Sources
Connecting other Data Sources
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.te...
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.te...
Connecting other Data Sources
SELECT teams.name, SUM( batting.HR ) as hr_total
FROM batting
INNER JOIN teams ON batting.te...
Connecting other Data Sources
MySQL: 0.047 seconds
Drill: 0.366 seconds
SELECT teams.name, SUM( batting.HR ) as hr_total
F...
Conclusion
• Drill is easy to use
• Drill scales
• Drill is open source
• Drill is versatile
Why aren’t you using Drill?
Thank you!
Charles Givre
@cgivre
givre_charles@bah.com
thedataist.com
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Killing ETL with Apache Drill
Nächste SlideShare
Wird geladen in …5
×

Killing ETL with Apache Drill

2.480 Aufrufe

Veröffentlicht am

The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.

Veröffentlicht in: Daten & Analysen
  • Als Erste(r) kommentieren

Killing ETL with Apache Drill

  1. 1. Killing ETL with Drill Charles S. Givre @cgivre cgivre@thedataist.com
  2. 2. The problems
  3. 3. We want SQL and BI support without compromising flexibility and ability of NoSchema datastores.
  4. 4. Data is not arranged in an optimal way for ad-hoc analysis
  5. 5. Data is not arranged in an optimal way for ad-hoc analysis ETL Data Warehouse
  6. 6. Analytics teams spend between 50%-90% of their time preparing their data.
  7. 7. 76% of Data Scientist say this is the least enjoyable part of their job. http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
  8. 8. The ETL Process consumes the most time and contributes almost no value to the end product.
  9. 9. ETL Data Warehouse
  10. 10. “Any sufficiently advanced technology is indistinguishable from magic” —Arthur C. Clarke
  11. 11. You just query the data… no schema
  12. 12. Drill is NOT just SQL on Hadoop
  13. 13. Drill scales
  14. 14. Drill is open source Download Drill at: drill.apache.org
  15. 15. Why should you use Drill?
  16. 16. Why should you use Drill? Drill is easy to use
  17. 17. Drill is easy to use Drill uses standard ANSI SQL
  18. 18. Drill is FAST!!
  19. 19. https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
  20. 20. https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
  21. 21. https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
  22. 22. Quick Demo Thank you Jair Aguirre!!
  23. 23. Quick Demo seanlahman.com/baseball-archive/statistics
  24. 24. Quick Demo data = load '/user/cloudera/data/baseball_csv/Teams.csv' using PigStorage(','); filtered = filter data by ($0 == '1988'); tm_hr = foreach filtered generate (chararray) $40 as team, (int) $19 as hrs; ordered = order tm_hr by hrs desc; dump ordered; Execution Time: 1 minute, 38 seconds
  25. 25. Quick Demo SELECT columns[40], cast(columns[19] as int) AS HR FROM `baseball_csv/Teams.csv` WHERE columns[0] = '1988' ORDER BY HR desc; Execution Time: 0232 seconds!!
  26. 26. Drill is Versatile
  27. 27. NoSQL, No Problem
  28. 28. NoSQL, No Problem https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
  29. 29. NoSQL, No Problem https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json SELECT t.address.zipcode AS zip, count(name) AS rests FROM `restaurants` t GROUP BY t.address.zipcode ORDER BY rests DESC LIMIT 10;
  30. 30. Querying Across Silos
  31. 31. Querying Across Silos Farmers Market Data Restaurant Data
  32. 32. Querying Across Silos SELECT t1.Borough, t1.markets, t2.rests, cast(t1.markets AS FLOAT)/ cast(t2.rests AS FLOAT) AS ratio FROM ( SELECT Borough, count(`Farmers Markets Name`) AS markets FROM `farmers_markets.csv` GROUP BY Borough ) t1 JOIN ( SELECT borough, count(name) AS rests FROM mongo.test.`restaurants` GROUP BY borough ) t2 ON t1.Borough=t2.borough ORDER BY ratio DESC;
  33. 33. Querying Across Silos Execution Time: 0.502 Seconds
  34. 34. To follow along, please download the files at: https://github.com/cgivre/drillworkshop
  35. 35. Querying Drill
  36. 36. Querying Drill SELECT DISTINCT management_role FROM cp.`employee.json`;
  37. 37. Querying Drill http://localhost:8047
  38. 38. Querying Drill SELECT * FROM cp.`employee.json` LIMIT 20
  39. 39. Querying Drill SELECT * FROM cp.`employee.json` LIMIT 20
  40. 40. Querying Drill SELECT <fields> FROM <table> WHERE <optional logical condition>
  41. 41. Querying Drill SELECT name, address, email FROM customerData WHERE age > 20
  42. 42. Querying Drill SELECT name, address, email FROM dfs.logs.`/data/customers.csv` WHERE age > 20
  43. 43. Querying Drill FROM dfs.logs.`/data/customers.csv` Storage Plugin Workspace Table
  44. 44. Querying Drill Plugins Supported Description cp Queries files in the Java ClassPath dfs File System. Can connect to remote filesystems such as Hadoop hbase Connects to HBase hive Integrates Drill with the Apache Hive metastore kudu Provides a connection to Apache Kudu mongo Connects to mongoDB RDBMS Provides a connection to relational databases such as MySQL, Postgres, Oracle and others. S3 Provides a connection to an S3 cluster
  45. 45. Problem: You have multiple log files which you would like to analyze
  46. 46. Problem: You have multiple log files which you would like to analyze • In the sample data files, there is a folder called ‘logs’ which contains the following structure:
  47. 47. SELECT * FROM dfs.drillworkshop.`logs/` LIMIT 10
  48. 48. SELECT * FROM dfs.drillworkshop.`logs/` LIMIT 10
  49. 49. dirn accesses the subdirectories
  50. 50. dirn accesses the subdirectories SELECT * FROM dfs.drilldata.`logs/` WHERE dir0 = ‘2013’
  51. 51. Function Description MAXDIR(), MINDIR() Limit query to the first or last directory IMAXDIR(), IMINDIR() Limit query to the first or last directory in case insensitive order. Directory Functions WHERE dir<n> = MAXDIR('<plugin>.<workspace>', '<filename>')
  52. 52. In Class Exercise: Find the total number of items sold by year and the total dollar sales in each year. HINT: Don’t forget to CAST() the fields to appropriate data types SELECT dir0 AS data_year, SUM( CAST( item_count AS INTEGER ) ) as total_items, SUM( CAST( amount_spent AS FLOAT ) ) as total_sales FROM dfs.drillworkshop.`logs/` GROUP BY dir0
  53. 53. Let’s look at JSON data
  54. 54. Let’s look at JSON data [ { "name": "Farley, Colette L.", "email": "iaculis@atarcu.ca", "DOB": "2011-08-14", "phone": "1-758-453-3833" }, { "name": "Kelley, Cherokee R.", "email": "ante.blandit@malesuadafringilla.edu", "DOB": "1992-09-01", "phone": "1-595-478-7825" } … ]
  55. 55. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  56. 56. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  57. 57. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  58. 58. What about nested data?
  59. 59. Please open baltimore_salaries.json in a text editor
  60. 60. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  61. 61. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  62. 62. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  63. 63. "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, “393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", “55314.00", “53626.04" ]
  64. 64. Drill has a series of functions for nested data
  65. 65. Let’s look at this data in Drill
  66. 66. Let’s look at this data in Drill SELECT * FROM dfs.drillworkshop.`baltimore_salaries.json`
  67. 67. Let’s look at this data in Drill SELECT * FROM dfs.drillworkshop.`baltimore_salaries.json`
  68. 68. Let’s look at this data in Drill SELECT data FROM dfs.drillworkshop.`baltimore_salaries.json`
  69. 69. FLATTEN( <json array> ) separates elements in a repeated field into individual records.
  70. 70. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  71. 71. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  72. 72. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  73. 73. SELECT raw_data[8] AS name … FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json` )
  74. 74. SELECT raw_data[8] AS name, raw_data[9] AS job_title FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json` )
  75. 75. SELECT raw_data[9] AS job_title, AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary, COUNT( DISTINCT raw_data[8] ) AS person_count FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`json/baltimore_salaries.json` ) GROUP BY raw_data[9] ORDER BY avg_salary DESC
  76. 76. Using the JSON file, recreate the earlier query to find the average salary by job title and how many people have each job title.
  77. 77. Log Files
  78. 78. Log Files • Drill does not natively support reading log files… yet • If you are NOT using Merlin, included in the GitHub repo are several .jar files. Please take a second and copy them to <drill directory>/jars/3rdparty
  79. 79. Log Files 070823 21:00:32 1 Connect root@localhost on test1 070823 21:00:48 1 Query show tables 070823 21:00:56 1 Query select * from category 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1
  80. 80. log": { "type": "log", "extensions": [ "log" ], "fieldNames": [ "date", "time", "pid", "action", "query" ], "pattern": "(d{6})s(d{2}:d{2}:d{2})s+(d+)s(w+)s+(.+)" } }
  81. 81. SELECT * FROM dfs.drillworkshop.`log_files/mysql.log`
  82. 82. SELECT * FROM dfs.drillworkshop.`log_files/mysql.log`
  83. 83. HTTPD Log Files
  84. 84. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" 158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
  85. 85. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" 158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" "httpd": { "type": "httpd", "logFormat": "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"", "timestampFormat": null },
  86. 86. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 SELECT * FROM dfs.drillworkshop.`data_files/log_files/small-server- log.httpd`
  87. 87. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 SELECT * FROM dfs.drillworkshop.`data_files/log_files/small-server- log.httpd`
  88. 88. HTTPD Log Files SELECT request_referer, parse_url( request_referer ) AS url_data FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
  89. 89. HTTPD Log Files SELECT request_referer, parse_url( request_referer ) AS url_data FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
  90. 90. Networking Functions
  91. 91. Networking Functions • inet_aton( <ip> ): Converts an IPv4 Address to an integer • inet_ntoa( <int> ): Converts an integer to an IPv4 address • is_private(<ip>): Returns true if the IP is private • in_network(<ip>,<cidr>): Returns true if the IP is in the CIDR block • getAddressCount( <cidr> ): Returns the number of IPs in a CIDR block • getBroadcastAddress(<cidr>): Returns the broadcast address of a CIDR block • getNetmast( <cidr> ): Returns the net mask of a CIDR block • getLowAddress( <cidr>): Returns the low IP of a CIDR block • getHighAddress(<cidr>): Returns the high IP of a CIDR block • parse_user_agent( <ua_string> ): Returns a map of user agent information
  92. 92. PCAP Files
  93. 93. SELECT * FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
  94. 94. SELECT * FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
  95. 95. Connecting other Data Sources
  96. 96. Connecting other Data Sources
  97. 97. Connecting other Data Sources
  98. 98. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC
  99. 99. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC
  100. 100. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC MySQL: 0.047 seconds
  101. 101. Connecting other Data Sources MySQL: 0.047 seconds Drill: 0.366 seconds SELECT teams.name, SUM( batting.HR ) as hr_total FROM mysql.stats.batting INNER JOIN mysql.stats.teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY teams.name ORDER BY hr_total DESC
  102. 102. Conclusion • Drill is easy to use • Drill scales • Drill is open source • Drill is versatile
  103. 103. Why aren’t you using Drill?
  104. 104. Thank you! Charles Givre @cgivre givre_charles@bah.com thedataist.com

×