SlideShare ist ein Scribd-Unternehmen logo
1 von 13
© 2015 MapR Technologies 1
Follow me at @joebluems for link to code © 2015 MapR Technologies
Breach Detection with Apache Drill
© 2015 MapR Technologies 2
Follow me at @joebluems for link to code
Breach Happens!
© 2015 MapR Technologies 3
Follow me at @joebluems for link to code
Customer transactions – M-F
Sat.
Status
âś”
âś”
âś–
âś”
âś–
Finding the Source of Compromise*
* The source of the compromise may not be
where the fraudsters use the accounts
millions of
customers
millions of
merchant
locations
© 2015 MapR Technologies 4
Follow me at @joebluems for link to code
Apache Drill
linux> head -5 sample.json
{acct:"0",merchant:"6998",fraud:"0"}
{acct:"0",merchant:"1269",fraud:"0"}
{acct:"0",merchant:"4286",fraud:"0"}
{acct:"0",merchant:"2371",fraud:"0"}
{acct:"0",merchant:"4545",fraud:"0"}
<drill home>/bin/drill-embedded
drill> select * from `dfs`.`sample.json` limit 5;
+-------+-----------+--------+
| acct | merchant | fraud |
+-------+-----------+--------+
| 0 | 6998 | 0 |
| 0 | 1269 | 0 |
| 0 | 4286 | 0 |
| 0 | 2371 | 0 |
| 0 | 4545 | 0 |
+-------+-----------+--------+
• https://drill.apache.org
• “Schema-free SQL Query
Engine for Hadoop, NoSQL
and Cloud Storage”
• Write SQL queries to access
distributed files without
specifying a schema
• Note: use the backtick in the
SQL (not a single quote)
© 2015 MapR Technologies 5
Follow me at @joebluems for link to code
Scoring Merchants with Log Likelihood
LL = 2* yij log
j=1
2
ĂĄ
i=1
2
ĂĄ
yij
mij
æ
è
çç
ö
ø
Ă·Ă·
14.3
10 0
0 10,000
1 1
0.9013 1,000
1,000 100,000
2 2
NO
T
M2
NO
T
M1
FRAUD
NOT
FRAUD
FRAUD
NOT
FRAUD
• Measures how much fraud
we observed beyond what
should happen randomly
• Fraud counts alone do not
account for the popularity
of common merchants
© 2015 MapR Technologies 6
Follow me at @joebluems for link to code
Drill – Count All Frauds / Non-Frauds
select sum(totalFraud) as `countFraud`,
sum(totalNonFraud) as `countNonFraud` from
( select
case when fraud='1' then 1 else 0 end as `totalFraud`,
case when fraud='0' then 1 else 0 end as `totalNonFraud`
from ( select distinct acct,fraud from `dfs`.`sample.json`)
);
+-------------+----------------+
| countFraud | countNonFraud |
+-------------+----------------+
| 5000 | 95000 |
+-------------+----------------+
© 2015 MapR Technologies 7
Follow me at @joebluems for link to code
Drill – Count Frauds at Each Merchant
select merchant, sum(merchFraud) as `merchCountFraud`,
sum(merchNonFraud) as `merchCountNonFraud` from
(select merchant,
case when fraud='1' then 1 else 0 end as `merchFraud`,
case when fraud='0' then 1 else 0 end as `merchNonFraud`
from `dfs`.`sample.json`)
group by merchant
limit 5;
+-----------+------------------+---------------------+
| merchant | merchCountFraud | merchCountNonFraud |
+-----------+------------------+---------------------+
| 6998 | 11 | 121 |
| 1269 | 8 | 130 |
| 4286 | 1 | 116 |
| 2371 | 7 | 124 |
| 4545 | 4 | 133 |
+-----------+------------------+---------------------+
© 2015 MapR Technologies 8
Follow me at @joebluems for link to code
Drill UDF (Java) to calculate Log-Likelihood
public void eval() {
float ll = (float) 0.0;
int n12 = n1t.value - n11.value;
int n22 = n2t.value - n21.value;
int nt1 = n11.value + n21.value;
int nt2 = n12 + n22;
int nt = nt1 + nt2;
// calculate LL for non-zero elements
if (n11.value > 0) {
ll += n11.value * Math.log(n11.value / ((float) n1t.value * nt1 /nt)); }
if (n21.value > 0) {
ll += n21.value * Math.log(n21.value / ((float) n2t.value * nt1 / nt));}
if (n12 > 0) {
ll += (float) n12 * Math.log(n12 / ((float) n1t.value * nt2 / nt)); }
if (n22 > 0) {
ll += (float) n22 * Math.log(n22 / ((float) n2t.value * nt2 / nt)); }
// if the fraud rate is less than non-fraud rate, set LL to zero
if (n11.value/ (float)(n11.value+n21.value)<(n12/(float)(n12 + n22))) ll=0;
out.value = ll;
}
© 2015 MapR Technologies 9
Follow me at @joebluems for link to code
Putting it all together
select MERCH.merchant, MERCH.merchCountFraud as `n11`, MERCH.merchCountNonFraud as `n21`,
COUNTS.countFraud as `n1dot`, COUNTS.countNonFraud as `n2dot`,
loglikelihood(cast(MERCH.merchCountFraud as INT),
cast(MERCH.merchCountNonFraud as INT),
cast(COUNTS.countFraud as INT),
cast(COUNTS.countNonFraud as INT)) as `logLike` from (
select 1 as `dummy`,merchant,
sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud`
from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`,
case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json`
) group by merchant) `MERCH`
JOIN ( select 1 as `dummy`, sum(totalFraud) as `countFraud`,
sum(totalNonFraud) as `countNonFraud` from
( select case when fraud='1' then 1 else 0 end as `totalFraud`,
case when fraud='0' then 1 else 0 end as `totalNonFraud`
from ( select distinct acct,fraud from `dfs`.`sample.json`)
)) `COUNTS`
on MERCH.dummy=COUNTS.dummy
ORDER by loglike desc
limit 10;
© 2015 MapR Technologies 10
Follow me at @joebluems for link to code
Output from Previous Query…
+-----------+------+------+--------+--------+---------------------+
| merchant | n11 | n21 | n1dot | n2dot | logLike |
+-----------+------+------+--------+--------+---------------------+
| 5902 | 16 | 95 | 5000 | 95000 | 7.0296311378479 |
| 4666 | 17 | 118 | 5000 | 95000 | 5.880885601043701 |
| 3486 | 16 | 107 | 5000 | 95000 | 5.8762335777282715 |
| 7961 | 16 | 108 | 5000 | 95000 | 5.793434143066406 |
| 9182 | 16 | 110 | 5000 | 95000 | 5.631403923034668 |
| 7114 | 13 | 81 | 5000 | 95000 | 5.324999809265137 |
| 2127 | 16 | 115 | 5000 | 95000 | 5.222985744476318 |
| 1462 | 16 | 115 | 5000 | 95000 | 5.222985744476318 |
| 2994 | 14 | 94 | 5000 | 95000 | 5.113578796386719 |
| 5770 | 16 | 117 | 5000 | 95000 | 5.064565181732178 |
+-----------+------+------+--------+--------+---------------------+
© 2015 MapR Technologies 11
Follow me at @joebluems for link to code
Breaking Breaches
• Real-life example
• SQL output is
processed into
histogram
• Tableau chart
shows number of
merchants per
Breach score
© 2015 MapR Technologies 12
Follow me at @joebluems for link to code © 2014 MapR Technologies
Appendix
© 2015 MapR Technologies 13
Follow me at @joebluems for link to code
Additional Info
• Location of Code/Data Repository
– https://github.com/joebluems/BreachDetection
• Link to Blog on Breach Detection
– https://www.mapr.com/blog/identify-your-data-breach-apache-drill
• A little more on Log-Likelihood
– http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
• Drill
– Documentation: http://drill.apache.org/docs/
– UDFs: https://drill.apache.org/docs/deploying-and-using-a-hive-udf/
– Code for sample UDF: https://github.com/viadea/HiveUDF

Weitere ähnliche Inhalte

Andere mochten auch

Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningMapR Technologies
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
IoT Use Cases with MapR
IoT Use Cases with MapRIoT Use Cases with MapR
IoT Use Cases with MapRMapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR Technologies
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureMapR Technologies
 

Andere mochten auch (9)

Deep Learning vs. Cheap Learning
Deep Learning vs. Cheap LearningDeep Learning vs. Cheap Learning
Deep Learning vs. Cheap Learning
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
IoT Use Cases with MapR
IoT Use Cases with MapRIoT Use Cases with MapR
IoT Use Cases with MapR
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
MapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data PlatformMapR 5.2: Getting More Value from the MapR Converged Data Platform
MapR 5.2: Getting More Value from the MapR Converged Data Platform
 
How Spark is Enabling the New Wave of Converged Applications
How Spark is Enabling  the New Wave of Converged ApplicationsHow Spark is Enabling  the New Wave of Converged Applications
How Spark is Enabling the New Wave of Converged Applications
 
Zeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data ArchitectureZeta Architecture: The Next Generation Big Data Architecture
Zeta Architecture: The Next Generation Big Data Architecture
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

KĂĽrzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

KĂĽrzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Free Code Friday - Identify Your Data Breach with Apache Drill

  • 1. © 2015 MapR Technologies 1 Follow me at @joebluems for link to code © 2015 MapR Technologies Breach Detection with Apache Drill
  • 2. © 2015 MapR Technologies 2 Follow me at @joebluems for link to code Breach Happens!
  • 3. © 2015 MapR Technologies 3 Follow me at @joebluems for link to code Customer transactions – M-F Sat. Status âś” âś” âś– âś” âś– Finding the Source of Compromise* * The source of the compromise may not be where the fraudsters use the accounts millions of customers millions of merchant locations
  • 4. © 2015 MapR Technologies 4 Follow me at @joebluems for link to code Apache Drill linux> head -5 sample.json {acct:"0",merchant:"6998",fraud:"0"} {acct:"0",merchant:"1269",fraud:"0"} {acct:"0",merchant:"4286",fraud:"0"} {acct:"0",merchant:"2371",fraud:"0"} {acct:"0",merchant:"4545",fraud:"0"} <drill home>/bin/drill-embedded drill> select * from `dfs`.`sample.json` limit 5; +-------+-----------+--------+ | acct | merchant | fraud | +-------+-----------+--------+ | 0 | 6998 | 0 | | 0 | 1269 | 0 | | 0 | 4286 | 0 | | 0 | 2371 | 0 | | 0 | 4545 | 0 | +-------+-----------+--------+ • https://drill.apache.org • “Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage” • Write SQL queries to access distributed files without specifying a schema • Note: use the backtick in the SQL (not a single quote)
  • 5. © 2015 MapR Technologies 5 Follow me at @joebluems for link to code Scoring Merchants with Log Likelihood LL = 2* yij log j=1 2 ĂĄ i=1 2 ĂĄ yij mij æ è çç ö ø Ă·Ă· 14.3 10 0 0 10,000 1 1 0.9013 1,000 1,000 100,000 2 2 NO T M2 NO T M1 FRAUD NOT FRAUD FRAUD NOT FRAUD • Measures how much fraud we observed beyond what should happen randomly • Fraud counts alone do not account for the popularity of common merchants
  • 6. © 2015 MapR Technologies 6 Follow me at @joebluems for link to code Drill – Count All Frauds / Non-Frauds select sum(totalFraud) as `countFraud`, sum(totalNonFraud) as `countNonFraud` from ( select case when fraud='1' then 1 else 0 end as `totalFraud`, case when fraud='0' then 1 else 0 end as `totalNonFraud` from ( select distinct acct,fraud from `dfs`.`sample.json`) ); +-------------+----------------+ | countFraud | countNonFraud | +-------------+----------------+ | 5000 | 95000 | +-------------+----------------+
  • 7. © 2015 MapR Technologies 7 Follow me at @joebluems for link to code Drill – Count Frauds at Each Merchant select merchant, sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud` from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`, case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json`) group by merchant limit 5; +-----------+------------------+---------------------+ | merchant | merchCountFraud | merchCountNonFraud | +-----------+------------------+---------------------+ | 6998 | 11 | 121 | | 1269 | 8 | 130 | | 4286 | 1 | 116 | | 2371 | 7 | 124 | | 4545 | 4 | 133 | +-----------+------------------+---------------------+
  • 8. © 2015 MapR Technologies 8 Follow me at @joebluems for link to code Drill UDF (Java) to calculate Log-Likelihood public void eval() { float ll = (float) 0.0; int n12 = n1t.value - n11.value; int n22 = n2t.value - n21.value; int nt1 = n11.value + n21.value; int nt2 = n12 + n22; int nt = nt1 + nt2; // calculate LL for non-zero elements if (n11.value > 0) { ll += n11.value * Math.log(n11.value / ((float) n1t.value * nt1 /nt)); } if (n21.value > 0) { ll += n21.value * Math.log(n21.value / ((float) n2t.value * nt1 / nt));} if (n12 > 0) { ll += (float) n12 * Math.log(n12 / ((float) n1t.value * nt2 / nt)); } if (n22 > 0) { ll += (float) n22 * Math.log(n22 / ((float) n2t.value * nt2 / nt)); } // if the fraud rate is less than non-fraud rate, set LL to zero if (n11.value/ (float)(n11.value+n21.value)<(n12/(float)(n12 + n22))) ll=0; out.value = ll; }
  • 9. © 2015 MapR Technologies 9 Follow me at @joebluems for link to code Putting it all together select MERCH.merchant, MERCH.merchCountFraud as `n11`, MERCH.merchCountNonFraud as `n21`, COUNTS.countFraud as `n1dot`, COUNTS.countNonFraud as `n2dot`, loglikelihood(cast(MERCH.merchCountFraud as INT), cast(MERCH.merchCountNonFraud as INT), cast(COUNTS.countFraud as INT), cast(COUNTS.countNonFraud as INT)) as `logLike` from ( select 1 as `dummy`,merchant, sum(merchFraud) as `merchCountFraud`, sum(merchNonFraud) as `merchCountNonFraud` from (select merchant, case when fraud='1' then 1 else 0 end as `merchFraud`, case when fraud='0' then 1 else 0 end as `merchNonFraud` from `dfs`.`sample.json` ) group by merchant) `MERCH` JOIN ( select 1 as `dummy`, sum(totalFraud) as `countFraud`, sum(totalNonFraud) as `countNonFraud` from ( select case when fraud='1' then 1 else 0 end as `totalFraud`, case when fraud='0' then 1 else 0 end as `totalNonFraud` from ( select distinct acct,fraud from `dfs`.`sample.json`) )) `COUNTS` on MERCH.dummy=COUNTS.dummy ORDER by loglike desc limit 10;
  • 10. © 2015 MapR Technologies 10 Follow me at @joebluems for link to code Output from Previous Query… +-----------+------+------+--------+--------+---------------------+ | merchant | n11 | n21 | n1dot | n2dot | logLike | +-----------+------+------+--------+--------+---------------------+ | 5902 | 16 | 95 | 5000 | 95000 | 7.0296311378479 | | 4666 | 17 | 118 | 5000 | 95000 | 5.880885601043701 | | 3486 | 16 | 107 | 5000 | 95000 | 5.8762335777282715 | | 7961 | 16 | 108 | 5000 | 95000 | 5.793434143066406 | | 9182 | 16 | 110 | 5000 | 95000 | 5.631403923034668 | | 7114 | 13 | 81 | 5000 | 95000 | 5.324999809265137 | | 2127 | 16 | 115 | 5000 | 95000 | 5.222985744476318 | | 1462 | 16 | 115 | 5000 | 95000 | 5.222985744476318 | | 2994 | 14 | 94 | 5000 | 95000 | 5.113578796386719 | | 5770 | 16 | 117 | 5000 | 95000 | 5.064565181732178 | +-----------+------+------+--------+--------+---------------------+
  • 11. © 2015 MapR Technologies 11 Follow me at @joebluems for link to code Breaking Breaches • Real-life example • SQL output is processed into histogram • Tableau chart shows number of merchants per Breach score
  • 12. © 2015 MapR Technologies 12 Follow me at @joebluems for link to code © 2014 MapR Technologies Appendix
  • 13. © 2015 MapR Technologies 13 Follow me at @joebluems for link to code Additional Info • Location of Code/Data Repository – https://github.com/joebluems/BreachDetection • Link to Blog on Breach Detection – https://www.mapr.com/blog/identify-your-data-breach-apache-drill • A little more on Log-Likelihood – http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html • Drill – Documentation: http://drill.apache.org/docs/ – UDFs: https://drill.apache.org/docs/deploying-and-using-a-hive-udf/ – Code for sample UDF: https://github.com/viadea/HiveUDF

Hinweis der Redaktion

  1. Depends on size and overlap. Significance is measured in overlap beyond expected. 1 vs. 2. – both rare items so wouldn’t expect much overlap, but we see total (slightly askew to show both circles) 3 vs. 4 – popular items, so expect higher number of overlap Can distribute these calculations (map-reduce, Mahout, Spark, etc.)