SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Data Science at Scale
Spark – Zeppelin - ML
Kirk Haslbeck, Sr. Solution Engineer HWX
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kirk Haslbeck - Hortonworks
Sr. Solution Engineer @ Hortonworks
Lead Architect for Trade Surveillance @ Morgan Stanley
Masters in Data Mining @UMBC
Computer Science Degree @ Mount Saint Mary’s University
github.com/kirkhas/zeppelin-notebooks
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Apache Open Source Project
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why do we need Spark?
 Distributed
– Multi-threading is hard to do in Java but even if you get it right it isn’t distributed. It is limited to a
single JVM
 Horizontal
– Spark can take advantage of a modern data architecture. Scales out as a function of hardware.
 Data Science
– Language R, Python both growing in popularity and great for statistical workloads but suffer from
single threaded nature.
 Need for a top level computing language
– SQL is great and provides a lot of what we need but not everything. Tradeoffs occur when SQL is
better for some operations and a full programming language for others. Spark satisfies both!
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark API Languages
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark - Functional + Distributed = Concise and Powerful
Spark Map Function Java Thread Pool
Objective: we have a list of tasks and we want to
pad each project timeline with 20% time buffer
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Spark chose Scala?
 Functional
– Map, Filter, Fold, GroupBy
– 5-10X code reduction
 Immutable
– No state management, less headache, each operation is fully encapsulated.
 Thread Safety is the Biggest Challenge
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDDs, DataFrames and DataSets
 Resilient Distributed Dataset
– Good for schema
– case class Trade (sym: String, price: Double)
 DataFrame
– SQL like operations, higher level object
– aggregations, ordering
 Interoperability
– Finally interop between Tables, Classes, and Vectors for Data Science. Borrowing the best from R,
Scala and SQL. Impedance mismatch solved, no need for Domain Layer, Data Access Layer
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
RDD (low level) vs. DataFrames (new API)
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 101 – Execution Model
Spark Driver
–Client side application that creates Spark Context
Spark Context
–Talks to Spark Driver, Cluster Manager to Launch Spark Executors
Cluster Manager – E.g YARN, Spark Standalone, MESOS
Executors – Spark worker bees
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Engine in the HDP Stack
Spark is first-class citizen of Hadoop
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
Show me the Code!
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Model Inputs
Data Gathering
Custom Logic
Process Flow
Evaluate Results
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What About Machine Learning?
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning and Big Data
Machine learning has advanced to the point where it more or less goes hand-
in-hand with Big Data. Indeed, so popular is the technology that over a third of
developers – some 36 percent – who are working on Big Data or advanced
analytics projects use elements of machine learning, says a new study by
Evans Data Corp.
Machine Learning involves creating and improving complex algorithms that are
able to analyze data automatically and identify patterns or predict outcomes
based on the knowledge they have “learned”. As such, it has great potential for
helping companies to better understand what their data is telling them.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Where Can We Use Data Science?
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Customer Use Cases with Spark
Web Analytics - WebTrends
Web Analytics for Marketing
• Ingesting 13 Billion events/Day
• Use Spark Streaming & Samza for Data Ingest
• Extremely low latency: 40 milliseconds
• Need more metrics for Spark Streaming
• Wants 2 way SSL for Kafka Spark receiver
Bank/Credit Card
Real time monitoring and Fraud
Detection
• Monitor ATM with NiFi
• Start with Log Aggregation
• Tackle fraud detection next
Railroad Company
Real time view of state of track
• Optimize the train maintenance
• Large volume of track data, down to feel
granularity
• GeoSpatial analytics is critical
Cable Company
Optimize Advertising
• Monitor channel changes with Spark Streaming
• Correlate changes with Ads/Programming
• Allocate Ads real time: Show ads to user who are
watching a show and will stay for > over 20
seconds
• How to optimize Spark App development
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Credit Card Fraud Detection
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building a Model
 Show of hands, how many have built a “Model”?
 What are some limitations?
– Conditional based logic: if/else binary decisions
 If you need a lot of data to build a good model, what tools can you use?
– Data volumes can eliminate the possibility of desktop tools
 Sampling?
– Well… we better get an even distribution of true and false positives in each sample, but wait that
requires data munging, back to what tools can we use.
 Security Concerns?
– Extracting data from it’s secure resting place and pushing it into other environments, often times
unsecure files or desktops where Matlab or R can be installed.
 Collaboration
– Push processing to the data using modern distributed tooling.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“All models are wrong, some are useful”
George E. P. Box
Most limiting factor is the data, with modern systems we are now able to
capture more data and hopefully produce better insights
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Credit Card Fraud
 Requirement: Detect fraudulent transactions.
 Goal: Save the card company money and build trust amongst card users. Cut down on
fraudulent crime
 Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt
and make smarter decisions over time.
 Design
– Distance: How far can one travel over a period of time before it is fraudulent?
– Category: How can we detect a purchase that a customer wouldn’t likely make?
– Frequency: How can we detect purchasing patterns that do not resemble the card holder?
 Ideas?
– White board some conditional logic, egregiousness vs binary
– Back test the data
– Build a model per card holder?
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Rules, Statistics, Machine Learning
 Rule Based Logic
– Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to
over engineer.
– Example: Spending Limit. Card holder limit = $2,000
• If (currentPurchaseAmount + balance > 2,000) then deny transaction
 Statistics
– Mean, median, mode, variance, deviation
– Anomaly detection. Outliers. (i.e. womens retail example)
 Machine Learning
– Supervised
– Unsupervised
– Trainable
– Adapt over time
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Discovery
 Gathered all Credit Card Transactions
– Problem is they didn’t make sense
– No identifiable patterns, no log normal curves
– Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55
 Classification
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Outlier Detection: identify abnormal patterns
Example: identify anomalies
Features:
- Time frequency
- Category
- Amount
- Distance
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 26
Hortonworks Data Flow
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Page 27
Hortonworks Data Flow
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Machine Learning Continued
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Classification: predicting a category
Some techniques:
- Naïve Bayes
- Decision Tree
- Logistic Regression
- SGD
- Support Vector Machines
- Neural Network
- Ensembles
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Regression: predict a continuous value
Some techniques:
- Linear Regression / GLM
- Decision Trees
- Support vector regression
- SGD
- Ensembles
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Unsupervised Learning: detect natural patterns
Age State Annual Income Marital
status
25 CA $80,000 M
45 NY $150,000 D
55 WA $100,500 M
18 TX $85,000 S
… … … …
No labels
Model Naturally occurri
(hidden) structur
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Clustering: detect similar instance groupings
Some techniques:
- k-means
- Spectral clustering
- DB-scan
- Hierarchical clustering
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Getting the Proper Fit
Over-fitting:
Model over-fits training set, but does not generalize well to new inputs
Under-fitting:
Model can’t predict accurately
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Intelligence
vs
Data Science
R and Matplotlib now available
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
R and Matlab Visuals in Zeppelin
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Matplotlib with Python
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Appendix – Links to content
Github
https://github.com/kirkhas/zeppelin-notebooks
Credit Card Fraud (real-time ML)
https://community.hortonworks.com/articles/38457/credit-fraud-prevention-demo-a-guided-tour.html
Monte Carlo / VaR
https://community.hortonworks.com/articles/39096/predicting-stock-portfolio-gains-using-monte-carlo.html
Stock Variance
https://community.hortonworks.com/repos/32713/stock-variance-using-zeppelin.html
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Weitere ähnliche Inhalte

Was ist angesagt?

AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionAI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionDr. Haxel Consult
 
A Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura WynterA Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura Wynterwkwsci-research
 
HPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems
 
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Cloudera, Inc.
 
Presentation at Wright State University
Presentation at Wright State UniversityPresentation at Wright State University
Presentation at Wright State UniversityHPCC Systems
 
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...Dr. Haxel Consult
 
DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼Sutaek Kim
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
Solving Compliance for Big Data
Solving Compliance for Big DataSolving Compliance for Big Data
Solving Compliance for Big Datafbeckett1
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergencekvnnrao
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraChun Myung Kyu
 
Big Data in small words
Big Data in small wordsBig Data in small words
Big Data in small wordsYogesh Tomar
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownInside Analysis
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
 

Was ist angesagt? (20)

Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New PrecisionAI-SDV 2021: Francisco Webber - Efficiency is the New Precision
AI-SDV 2021: Francisco Webber - Efficiency is the New Precision
 
A Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura WynterA Big Data Telco Solution by Dr. Laura Wynter
A Big Data Telco Solution by Dr. Laura Wynter
 
HPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & AnalyticsHPCC Systems - Open source, Big Data Processing & Analytics
HPCC Systems - Open source, Big Data Processing & Analytics
 
Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17Transform Banking with Big Data and Automated Machine Learning 9.12.17
Transform Banking with Big Data and Automated Machine Learning 9.12.17
 
Presentation at Wright State University
Presentation at Wright State UniversityPresentation at Wright State University
Presentation at Wright State University
 
7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases7 Predictive Analytics, Spark , Streaming use cases
7 Predictive Analytics, Spark , Streaming use cases
 
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
AI-SDV 2020: Special Hypertext Information Treatment in is Special Hypertext ...
 
DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼DataRobot - 머신러닝 자동화 플랫폼
DataRobot - 머신러닝 자동화 플랫폼
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Solving Compliance for Big Data
Solving Compliance for Big DataSolving Compliance for Big Data
Solving Compliance for Big Data
 
Hadoop Demo eConvergence
Hadoop Demo eConvergenceHadoop Demo eConvergence
Hadoop Demo eConvergence
 
How to design ai functions to the cloud native infra
How to design ai functions to the cloud native infraHow to design ai functions to the cloud native infra
How to design ai functions to the cloud native infra
 
Big Data in small words
Big Data in small wordsBig Data in small words
Big Data in small words
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data LetdownFit For Purpose: Preventing a Big Data Letdown
Fit For Purpose: Preventing a Big Data Letdown
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 

Andere mochten auch

Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceApache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceBikas Saha
 
Zeppelin meetup 2016 madrid
Zeppelin meetup 2016 madridZeppelin meetup 2016 madrid
Zeppelin meetup 2016 madridJongyoul Lee
 
IBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Sverige
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Eugene
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformEZ-R Stats, LLC
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)Amazon Web Services
 
Real-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment TransactionsReal-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment TransactionsChristian Gügi
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisJen Aman
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
7 Keys to Fraud Prevention, Detection and Reporting
7 Keys to Fraud Prevention, Detection and Reporting7 Keys to Fraud Prevention, Detection and Reporting
7 Keys to Fraud Prevention, Detection and ReportingBrown Smith Wallace
 
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...Codemotion
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeSpark Summit
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With SparkShivaji Dutta
 

Andere mochten auch (18)

Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data ScienceApache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
 
Zeppelin meetup 2016 madrid
Zeppelin meetup 2016 madridZeppelin meetup 2016 madrid
Zeppelin meetup 2016 madrid
 
IBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategyIBM Netezza - The data warehouse in a big data strategy
IBM Netezza - The data warehouse in a big data strategy
 
Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning Data Pipelines and Telephony Fraud Detection Using Machine Learning
Data Pipelines and Telephony Fraud Detection Using Machine Learning
 
Fraud Detection Using A Database Platform
Fraud Detection Using A Database PlatformFraud Detection Using A Database Platform
Fraud Detection Using A Database Platform
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
AWS re:Invent 2016: Fraud Detection with Amazon Machine Learning on AWS (FIN301)
 
Real-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment TransactionsReal-Time Fraud Detection in Payment Transactions
Real-Time Fraud Detection in Payment Transactions
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Credit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph AnalysisCredit Fraud Prevention with Spark and Graph Analysis
Credit Fraud Prevention with Spark and Graph Analysis
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
7 Keys to Fraud Prevention, Detection and Reporting
7 Keys to Fraud Prevention, Detection and Reporting7 Keys to Fraud Prevention, Detection and Reporting
7 Keys to Fraud Prevention, Detection and Reporting
 
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
How to Apply Big Data Analytics and Machine Learning to Real Time Processing ...
 
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo LeeData Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
 
Machine Learning With Spark
Machine Learning With SparkMachine Learning With Spark
Machine Learning With Spark
 

Ähnlich wie Spark-Zeppelin-ML on HWX

Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stackKirk Haslbeck
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceThiago Santiago
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonSynerzip
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLAPaul Barsch
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the GlobeDataWorks Summit
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
 
Data science workshop
Data science workshopData science workshop
Data science workshopHortonworks
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...DataWorks Summit
 
10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsxSangeetaTripathi8
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Steve Keil
 
The Journey to Big Data Analytics
The Journey to Big Data AnalyticsThe Journey to Big Data Analytics
The Journey to Big Data AnalyticsDr.Stefan Radtke
 
Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016OMNETRIC
 
The Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management StackThe Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management StackSnapLogic
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at ScaleDataWorks Summit
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 

Ähnlich wie Spark-Zeppelin-ML on HWX (20)

Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stack
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughtonReal-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
Real-Time With AI – The Convergence Of Big Data And AI by Colin MacNaughton
 
Harnessing Big Data_UCLA
Harnessing Big Data_UCLAHarnessing Big Data_UCLA
Harnessing Big Data_UCLA
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe10 Lessons Learned from Meeting with 150 Banks Across the Globe
10 Lessons Learned from Meeting with 150 Banks Across the Globe
 
#HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course #HSTokyo16 Apache Spark Crash Course
#HSTokyo16 Apache Spark Crash Course
 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
 
10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx10-Hot-Data-Analytics-Tre-8904178.ppsx
10-Hot-Data-Analytics-Tre-8904178.ppsx
 
Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!Mammothdb - Public VC Pitchdeck!
Mammothdb - Public VC Pitchdeck!
 
The Journey to Big Data Analytics
The Journey to Big Data AnalyticsThe Journey to Big Data Analytics
The Journey to Big Data Analytics
 
Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016
 
The Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management StackThe Impact of SMACT on the Data Management Stack
The Impact of SMACT on the Data Management Stack
 
Solving Cybersecurity at Scale
Solving Cybersecurity at ScaleSolving Cybersecurity at Scale
Solving Cybersecurity at Scale
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 

Kürzlich hochgeladen

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 

Kürzlich hochgeladen (20)

Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 

Spark-Zeppelin-ML on HWX

  • 1. Data Science at Scale Spark – Zeppelin - ML Kirk Haslbeck, Sr. Solution Engineer HWX
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kirk Haslbeck - Hortonworks Sr. Solution Engineer @ Hortonworks Lead Architect for Trade Surveillance @ Morgan Stanley Masters in Data Mining @UMBC Computer Science Degree @ Mount Saint Mary’s University github.com/kirkhas/zeppelin-notebooks
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Apache Open Source Project
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why do we need Spark?  Distributed – Multi-threading is hard to do in Java but even if you get it right it isn’t distributed. It is limited to a single JVM  Horizontal – Spark can take advantage of a modern data architecture. Scales out as a function of hardware.  Data Science – Language R, Python both growing in popularity and great for statistical workloads but suffer from single threaded nature.  Need for a top level computing language – SQL is great and provides a lot of what we need but not everything. Tradeoffs occur when SQL is better for some operations and a full programming language for others. Spark satisfies both!
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark API Languages
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark - Functional + Distributed = Concise and Powerful Spark Map Function Java Thread Pool Objective: we have a list of tasks and we want to pad each project timeline with 20% time buffer
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Spark chose Scala?  Functional – Map, Filter, Fold, GroupBy – 5-10X code reduction  Immutable – No state management, less headache, each operation is fully encapsulated.  Thread Safety is the Biggest Challenge
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDDs, DataFrames and DataSets  Resilient Distributed Dataset – Good for schema – case class Trade (sym: String, price: Double)  DataFrame – SQL like operations, higher level object – aggregations, ordering  Interoperability – Finally interop between Tables, Classes, and Vectors for Data Science. Borrowing the best from R, Scala and SQL. Impedance mismatch solved, no need for Domain Layer, Data Access Layer
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved RDD (low level) vs. DataFrames (new API)
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark 101 – Execution Model Spark Driver –Client side application that creates Spark Context Spark Context –Talks to Spark Driver, Cluster Manager to Launch Spark Executors Cluster Manager – E.g YARN, Spark Standalone, MESOS Executors – Spark worker bees
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Engine in the HDP Stack Spark is first-class citizen of Hadoop
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Show me the Code!
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Model Inputs Data Gathering Custom Logic Process Flow Evaluate Results
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What About Machine Learning?
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning and Big Data Machine learning has advanced to the point where it more or less goes hand- in-hand with Big Data. Indeed, so popular is the technology that over a third of developers – some 36 percent – who are working on Big Data or advanced analytics projects use elements of machine learning, says a new study by Evans Data Corp. Machine Learning involves creating and improving complex algorithms that are able to analyze data automatically and identify patterns or predict outcomes based on the knowledge they have “learned”. As such, it has great potential for helping companies to better understand what their data is telling them.
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Where Can We Use Data Science? Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security Retail • Product recommendation • Inventory management • Price optimization Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Customer Use Cases with Spark Web Analytics - WebTrends Web Analytics for Marketing • Ingesting 13 Billion events/Day • Use Spark Streaming & Samza for Data Ingest • Extremely low latency: 40 milliseconds • Need more metrics for Spark Streaming • Wants 2 way SSL for Kafka Spark receiver Bank/Credit Card Real time monitoring and Fraud Detection • Monitor ATM with NiFi • Start with Log Aggregation • Tackle fraud detection next Railroad Company Real time view of state of track • Optimize the train maintenance • Large volume of track data, down to feel granularity • GeoSpatial analytics is critical Cable Company Optimize Advertising • Monitor channel changes with Spark Streaming • Correlate changes with Ads/Programming • Allocate Ads real time: Show ads to user who are watching a show and will stay for > over 20 seconds • How to optimize Spark App development
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Example: Credit Card Fraud Detection
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building a Model  Show of hands, how many have built a “Model”?  What are some limitations? – Conditional based logic: if/else binary decisions  If you need a lot of data to build a good model, what tools can you use? – Data volumes can eliminate the possibility of desktop tools  Sampling? – Well… we better get an even distribution of true and false positives in each sample, but wait that requires data munging, back to what tools can we use.  Security Concerns? – Extracting data from it’s secure resting place and pushing it into other environments, often times unsecure files or desktops where Matlab or R can be installed.  Collaboration – Push processing to the data using modern distributed tooling.
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “All models are wrong, some are useful” George E. P. Box Most limiting factor is the data, with modern systems we are now able to capture more data and hopefully produce better insights
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Credit Card Fraud  Requirement: Detect fraudulent transactions.  Goal: Save the card company money and build trust amongst card users. Cut down on fraudulent crime  Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt and make smarter decisions over time.  Design – Distance: How far can one travel over a period of time before it is fraudulent? – Category: How can we detect a purchase that a customer wouldn’t likely make? – Frequency: How can we detect purchasing patterns that do not resemble the card holder?  Ideas? – White board some conditional logic, egregiousness vs binary – Back test the data – Build a model per card holder?
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Rules, Statistics, Machine Learning  Rule Based Logic – Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to over engineer. – Example: Spending Limit. Card holder limit = $2,000 • If (currentPurchaseAmount + balance > 2,000) then deny transaction  Statistics – Mean, median, mode, variance, deviation – Anomaly detection. Outliers. (i.e. womens retail example)  Machine Learning – Supervised – Unsupervised – Trainable – Adapt over time
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Discovery  Gathered all Credit Card Transactions – Problem is they didn’t make sense – No identifiable patterns, no log normal curves – Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55  Classification
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Outlier Detection: identify abnormal patterns Example: identify anomalies Features: - Time frequency - Category - Amount - Distance
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 26 Hortonworks Data Flow
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Page 27 Hortonworks Data Flow
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Machine Learning Continued
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Classification: predicting a category Some techniques: - Naïve Bayes - Decision Tree - Logistic Regression - SGD - Support Vector Machines - Neural Network - Ensembles
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Regression: predict a continuous value Some techniques: - Linear Regression / GLM - Decision Trees - Support vector regression - SGD - Ensembles
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Unsupervised Learning: detect natural patterns Age State Annual Income Marital status 25 CA $80,000 M 45 NY $150,000 D 55 WA $100,500 M 18 TX $85,000 S … … … … No labels Model Naturally occurri (hidden) structur
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Clustering: detect similar instance groupings Some techniques: - k-means - Spectral clustering - DB-scan - Hierarchical clustering
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Getting the Proper Fit Over-fitting: Model over-fits training set, but does not generalize well to new inputs Under-fitting: Model can’t predict accurately
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Business Intelligence vs Data Science R and Matplotlib now available
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved R and Matlab Visuals in Zeppelin
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Matplotlib with Python
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Appendix – Links to content Github https://github.com/kirkhas/zeppelin-notebooks Credit Card Fraud (real-time ML) https://community.hortonworks.com/articles/38457/credit-fraud-prevention-demo-a-guided-tour.html Monte Carlo / VaR https://community.hortonworks.com/articles/39096/predicting-stock-portfolio-gains-using-monte-carlo.html Stock Variance https://community.hortonworks.com/repos/32713/stock-variance-using-zeppelin.html
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hinweis der Redaktion

  1. We have other engines out there, plenty of them and we have SQL / Java
  2. Language trends, thirst for answers.
  3. Example each task takes a certain amount of time but we want to buffer the time and leave room for Murphy’s law that something will go wrong or take a bit longer. 20% more time. Each unit is independent.
  4. Concise, declarartive but also provides greater description to CPU on how to handle the problem.
  5. Anywhere is the real answer. Here’s a few examples that have been bubbling up among our customers. Speaker: pick one of these and describe, briefly, an example that you know about in 4-5 sentences. One ‘close’ to the audience of course is the best. There’s separate use case deck by industry in the workshop wiki page that you can use for ideas.
  6. Example from movie: “IDENTITY THEFT” Some common applications of outlier detection include: Fraud detection: Purchasing behavior of a credit card owner usually changes when the purchasing behavior of a credit card owner usually changes when the card is stolen and the abnormal buying patterns can indicate fraud. Medicine: Unusual test results may indicate an underlying health issue Sports: Exceptional players may appear as outliers in particular parameters and placed in positions where the team can most benefit
  7. Gas, Convenience, retail,
  8. Before we can detect an outlier, we have to define it. The most intuitive definition I’ve seen is from a 1980 book Identification of Outliers (Hawkins): “An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism” Speaker: It’s worth repeating that all of these topics a deep enough to have entire university library shelves filled with books on the topic. We’re just skimming the surface. Anomaly detection is related to clustering, but almost its inverse. If points that are similar to each other cluster together (representing “normal” behavior or patterns), we want to find the points that are NOT in any cluster.
  9. What: Identify what group a new observation belong in. This is a simplified visual example, taken from Andrew Ng’s ML class. Plotted are only 2 dimensions (features) from the full feature vector – age and tumor size. The instances provided are “labeled” – in this case marked in red/x vs. blue/circle. Note that some red instances are inside the blue cluster, which represents the fact that learning sets sometimes introduce noise, making the learning task not trivial. Similarly some blue points are inside the red cluster.
  10. Regression is supervised learning where instead of predicting a category (like malignant or benign from the previous examples) we predict a “value” – a number. In this example (again from Andrew Ng’s class) we are trying to predict the “price of a house” given a single variable: size in feet. Clearly more complex models in multiple dimensions would be better; for example, we can use other features like “age of house” or “number of previous owners”, “geographic location” or “score for closest public school”
  11. With unsupervised learning, we again have as input a feature matrix with rows as instances and columns as variables, but NO LABELS. Now the goal is to find a label (cluster number) for each instance, but we are not learning a given function to match, rather trying to figure out the natural way instances may be grouped together. Note that we are usually NOT given the number of desired cluster (often called “K”), and may need to determine this on our own.
  12. Over-fitting means the model performs very well on the training set but does not generalize well so results on unseen data are poor. As shown in the diagram, this means the model learned the specific granular details of the training set and not the generic function it was meant to learn. This is why we “evaluate” on the validation set (and not the training set), because if we measured error on the training set we may get a false sense of performance if the model is over-fitting. Under-fittingmeans the model doesn’t have enough degrees of freedom to learn the needed model, and usually has a high bias. Underfitting is often a result of an excessively simple model. In practice you won’t encounter underfitting very often. Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few. Nonetheless, when building any predictive model, you should use validation or cross-validation to assess predictive accuracy and avoid these problems. Here we may have many observations, but too few features (matrix is tall and narrow).