SlideShare a Scribd company logo
1 of 64
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE
CRASH COURSE
BERLIN 2018
Robert Hryniewicz
Data Evangelist
@RobHryniewicz
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AI
MACHINE LEARNING
DEEP LEARNING
WHY NOW?
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key drivers behind AI Explosion
 Exponential data growth
– And the ability to Process All Data – both Structured & Unstructured
 Faster & open distributed systems
– Such as Hadoop, Spark, TensorFlow, …
 Smarter algorithms
– Esp. in the Machine Learning and Deep Learning domains
– More Accurate Models  Better ROI for Customers
Source: Deloitte Tech Trends 2017 report
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Healthcare
Predict diagnosis
Prioritize screenings
Reduce re-admittance rates
Financial services
Fraud Detection/prevention
Predict underwriting risk
New account risk screens
Public Sector
Analyze public sentiment
Optimize resource allocation
Law enforcement & security
Retail
Product recommendation
Inventory management
Price optimization
Telco/mobile
Predict customer churn
Predict equipment failure
Customer behavior analysis
Oil & Gas
Predictive maintenance
Seismic data management
Predict well production levels
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Google does not have better algorithms, only more data.
-- Peter Norvig, Dir of Research, Google
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 50ZB+ in 2021
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data: The New Oil
Training Data: The New New Oil
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“Effectiveness of AI technologies will be only as good as the data they have
access to, and the most valuable data may exist beyond the borders of
one’s own organization.”
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE PREREQUISITES
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Source: hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE & MACHINE LEARNING
WHAT IS A MODEL?
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is a ML Model?
 Mathematical formula with a number of parameters that need to be learned from the
data. Fitting a model to the data is a process known as model training.
 E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …
y = 2x + 5
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
ALGORITHMS
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression
• Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
 Logistic Regression
– Fast training, linear model
– Classes expressed in probabilities
 Support Vector Machines (SVM)
– “Best” supervised learning algorithm, effective
– More robust to outliers than Log Regression
– Handles non-linearity
 Random Forest
– Fast training
– Handles categorical features
– Does not require feature scaling
– Captures non-linearity and
feature interaction
 Naïve Bayes
– Good for text classification
– Assumes independent variables
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Intro to Decision Trees
 http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
 Removing noise in images by selecting only
“important” features
 Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
START
Regression
Classification Deep Learning
Clustering
Dimensionality Reduction
• XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
• Yinyang K-Means
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Local Regression (LOESS)
Collaborative Filtering
• Weighted Alternating Least
Squares (WALS)
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DEEP LEARNING
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
TensorFlow Playground
playground.tensorflow.org
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Identify the right Deep Learning problems
 DL is terrific at language tasks, image classification, speech translation, machine
translation, and game playing (i.e. Chess, Go, Starcraft).
 It is less performant at traditional Machine Learning tasks such as credit card fraud
detection, asset pricing, and credit scoring.
Source: towardsdatascience.com/deep-misconceptions-about-deep-learning-f26c41faceec
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Limits to Deep Learning (DL)
 We don’t have infinite datasets
– DL is not great at generalizing
– ImageNet: 9 layers and 60 mil parameters with 650,000 nodes from 1 mil examples with 1000
categories
 Top 10 challenges for Deep Learning
1.Data hungry
2.Shallow and limited capacity for transfer
3.No natural way to deal w/ hierarchical structure
4.Struggles w/ open-ended inference
5.Not sufficiently transparent
6.Now well integrated w/ prior knowledge
7.Cannot distinguish causation from correlation
8.Presumes stable world
9.Works well as an approximation, but answers cannot be fully trusted
10. Difficult to engineer with
Source: arxiv.org/pdf/1801.00631.pdf
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
AI HACKING
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DATA SCIENCE JOURNEY
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Start by Asking Relevant Questions
 Specific (can you think of a clear answer?)
 Measurable (quantifiable? data driven?)
 Actionable (if you had an answer, could you do something with it?)
 Realistic (can you get an answer with data you have?)
 Timely (answer in reasonable timeframe?)
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States  California, CA, Cal., Cal
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature Selection
 Also known as variable or attribute selection
 Why important?
– simplification of models  easier to interpret by researchers/users
– shorter training times
– enhanced generalization by reducing overfitting
 Dimensionality reduction vs feature selection
– Dimensionality reduction: create new combinations of attributes
– Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hyperparameters
 Define higher-level model properties, e.g. complexity or learning rate
 Cannot be learned during training  need to be predefined
 Can be decided by
– setting different values
– training different models
– choosing the values that test better
 Hyperparameter examples
– Number of leaves or depth of a tree
– Number of latent factors in a matrix factorization
– Learning rate (in many models)
– Number of hidden layers in a deep neural network
– Number of clusters in a k-means clustering
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Residuals
• residual of an observed value is the difference between
the observed value and the estimated value
 R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
 RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
With that in mind…
 No simple formula for “good questions” only general guidelines
 The right data is better than lots of data
 Understanding relationships matters
47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Training Set
Learning Algorithm
h
hypothesis/model
input output
Ingest / Enrich Data
Clean / Transform / Filter
Select / Create New Features
Evaluate Accuracy / Score
48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MODEL TRAINING
49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scatter Data
|label|features|
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...
50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result
51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563
52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SPARK ML PIPELINES
53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model
54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark ML Pipeline
 fit() is for training
 transform() is for prediction
Input
DataFrame
Input
DataFrame
Output
Dataframe
Pipeline
Pipeline Model
fit
transform
Train
Predict
55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
SAMPLE CODE
56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX + HDP
57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Source: Google NIPS
ML Code
58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
D ATA S C I E N C E P L AT F O R M
Community Open Source Scale & Enterprise Security
• Find tutorials and datasets
• Connect with Data Scientists
• Ask questions
• Read articles and papers
• Fork and share projects
• Code in Scala/Python/R/SQL
• Zeppelin & Jupyter Notebooks
• RStudio IDE and Shiny
• Apache Spark
• Your favorite libraries
• Data Science at Scale
• Run Spark Jobs on HDP Cluster
• Secure Hadoop Support
• Ranger Atlas Support for Data
• Support for ABAC
Model Management
• Data Shaping Pipeline UI
• Auto-data preparation & modeling
• Advanced Visualizations
• Model management & deployment
• Documented Model APIs
Data Science Experience
SPSS Modeler
for DSX
DO for DSX
60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSX
HDP
Data Science Experience (DSX) Local
Enterprise Data Science platform for teams
Livy REST interface
Hortonworks Data Platform (HDP)
Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)
61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
FINAL NOTES
62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
“The business value of AI consists of its ability to lower the cost of
prediction, just as computers lowered the cost of arithmetic.”
63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building a Business Around AI - HBR
1. Find and own valuable data no one else has
2. Take a systemic view of your business, and find data adjacencies
3. Package AI for the customer experience
"No single tool, even one as powerful as AI, determines the fate of a business. As
much as the world changes, deep truths — around unearthing customer
knowledge, capturing scarce goods, and finding profitable adjacencies — will
matter greatly. As ever, the technology works to the extent that its owners
know what it can do, and know their market."
64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Robert Hryniewicz
@RobHryniewicz
Thanks!

More Related Content

What's hot

Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...
DataWorks Summit
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
DataWorks Summit
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
DataWorks Summit
 

What's hot (20)

SparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data ScientistsSparkR Best Practices for R Data Scientists
SparkR Best Practices for R Data Scientists
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...Balancing data democratization with comprehensive information governance: bui...
Balancing data democratization with comprehensive information governance: bui...
 
Paris FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks PresentationParis FOD Meetup #5 Hortonworks Presentation
Paris FOD Meetup #5 Hortonworks Presentation
 
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)FOD Paris Meetup -  Global Data Management with DataPlane Services (DPS)
FOD Paris Meetup - Global Data Management with DataPlane Services (DPS)
 
Make Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the DetailsMake Streaming Analytics work for you: The Devil is in the Details
Make Streaming Analytics work for you: The Devil is in the Details
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Shaping a Digital Vision
Shaping a Digital VisionShaping a Digital Vision
Shaping a Digital Vision
 
Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...Designing data pipelines for analytics and machine learning in industrial set...
Designing data pipelines for analytics and machine learning in industrial set...
 
Security, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software IntegrationSecurity, ETL, BI & Analytics, and Software Integration
Security, ETL, BI & Analytics, and Software Integration
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
 

Similar to Data Science Crash Course

Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
DataWorks Summit
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
DataWorks Summit
 
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
DataWorks Summit
 
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataBig Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Matt Stubbs
 

Similar to Data Science Crash Course (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Spark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWXSpark-Zeppelin-ML on HWX
Spark-Zeppelin-ML on HWX
 
Enterprise data science at scale
Enterprise data science at scaleEnterprise data science at scale
Enterprise data science at scale
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Achieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturingAchieving a 360 degree view of manufacturing
Achieving a 360 degree view of manufacturing
 
Enterprise Data Science at Scale
Enterprise Data Science at ScaleEnterprise Data Science at Scale
Enterprise Data Science at Scale
 
Edw Optimization Solution
Edw Optimization Solution Edw Optimization Solution
Edw Optimization Solution
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop Hadoop Summit Tokyo HDP Sandbox Workshop
Hadoop Summit Tokyo HDP Sandbox Workshop
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Powering the Future of Data  
Powering the Future of Data	   Powering the Future of Data	   
Powering the Future of Data  
 
Credit fraud prevention on hwx stack
Credit fraud prevention on hwx stackCredit fraud prevention on hwx stack
Credit fraud prevention on hwx stack
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
IIoT + Predictive Analytics: Solving for Disruption in Oil & Gas and Energy &...
 
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big DataBig Data LDN 2016: Case Studies of Business Transformation through Big Data
Big Data LDN 2016: Case Studies of Business Transformation through Big Data
 
Hortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud EventHortonworks - IBM - Cloud Event
Hortonworks - IBM - Cloud Event
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ Hadoop Crash Course Hadoop Summit SJ
Hadoop Crash Course Hadoop Summit SJ
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

Data Science Crash Course

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE CRASH COURSE BERLIN 2018 Robert Hryniewicz Data Evangelist @RobHryniewicz
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI MACHINE LEARNING DEEP LEARNING WHY NOW?
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key drivers behind AI Explosion  Exponential data growth – And the ability to Process All Data – both Structured & Unstructured  Faster & open distributed systems – Such as Hadoop, Spark, TensorFlow, …  Smarter algorithms – Esp. in the Machine Learning and Deep Learning domains – More Accurate Models  Better ROI for Customers Source: Deloitte Tech Trends 2017 report
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Healthcare Predict diagnosis Prioritize screenings Reduce re-admittance rates Financial services Fraud Detection/prevention Predict underwriting risk New account risk screens Public Sector Analyze public sentiment Optimize resource allocation Law enforcement & security Retail Product recommendation Inventory management Price optimization Telco/mobile Predict customer churn Predict equipment failure Customer behavior analysis Oil & Gas Predictive maintenance Seismic data management Predict well production levels
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Google does not have better algorithms, only more data. -- Peter Norvig, Dir of Research, Google
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 50ZB+ in 2021
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data: The New Oil Training Data: The New New Oil
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “Effectiveness of AI technologies will be only as good as the data they have access to, and the most valuable data may exist beyond the borders of one’s own organization.”
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE PREREQUISITES
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007 Source: hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE & MACHINE LEARNING WHAT IS A MODEL?
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What is a ML Model?  Mathematical formula with a number of parameters that need to be learned from the data. Fitting a model to the data is a process known as model training.  E.g. linear regression – Goal: fit a line y = mx + c to data points – After model training: y = 2x + 5 Input OutputModel 1, 0, 7, 2, … 7, 5, 19, 9, … y = 2x + 5
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ALGORITHMS
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved START Regression Classification Collaborative Filtering Clustering Dimensionality Reduction • Logistic Regression • Support Vector Machines (SVM) • Random Forest (RF) • Naïve Bayes • Linear Regression • Alternating Least Squares (ALS) • K-Means, LDA • Principal Component Analysis (PCA)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CLASSIFICATION Identifying to which category an object belongs to Examples: spam detection, diabetes diagnosis, text labeling Algorithms:  Logistic Regression – Fast training, linear model – Classes expressed in probabilities  Support Vector Machines (SVM) – “Best” supervised learning algorithm, effective – More robust to outliers than Log Regression – Handles non-linearity  Random Forest – Fast training – Handles categorical features – Does not require feature scaling – Captures non-linearity and feature interaction  Naïve Bayes – Good for text classification – Assumes independent variables
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Intro to Decision Trees  http://www.r2d3.us/visual-intro-to-machine-learning-part-1 CLASSIFICATION
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved REGRESSION Predicting a continuous-valued output Example: Predicting house prices based on number of bedrooms and square footage Algorithms: Linear Regression
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved CLUSTERING Automatic grouping of similar objects into sets (clusters) Example: market segmentation – auto group customers into different market segments Algorithms: K-means, LDA
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved COLLABORATIVE FILTERING Fill in the missing entries of a user-item association matrix Applications: Product/movie recommendation Algorithms: Alternating Least Squares (ALS)
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DIMENSIONALITY REDUCTION Reducing the number of redundant features/variables Applications:  Removing noise in images by selecting only “important” features  Removing redundant features, e.g. MPH & KPH are linearly dependent Algorithms: Principal Component Analysis (PCA)
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved START Regression Classification Deep Learning Clustering Dimensionality Reduction • XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN) • Convolutional Neural Network (CNN) • Yinyang K-Means • t-Distributed Stochastic Neighbor Embedding (t-SNE) • Local Regression (LOESS) Collaborative Filtering • Weighted Alternating Least Squares (WALS)
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DEEP LEARNING
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved TensorFlow Playground playground.tensorflow.org
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Identify the right Deep Learning problems  DL is terrific at language tasks, image classification, speech translation, machine translation, and game playing (i.e. Chess, Go, Starcraft).  It is less performant at traditional Machine Learning tasks such as credit card fraud detection, asset pricing, and credit scoring. Source: towardsdatascience.com/deep-misconceptions-about-deep-learning-f26c41faceec
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Limits to Deep Learning (DL)  We don’t have infinite datasets – DL is not great at generalizing – ImageNet: 9 layers and 60 mil parameters with 650,000 nodes from 1 mil examples with 1000 categories  Top 10 challenges for Deep Learning 1.Data hungry 2.Shallow and limited capacity for transfer 3.No natural way to deal w/ hierarchical structure 4.Struggles w/ open-ended inference 5.Not sufficiently transparent 6.Now well integrated w/ prior knowledge 7.Cannot distinguish causation from correlation 8.Presumes stable world 9.Works well as an approximation, but answers cannot be fully trusted 10. Difficult to engineer with Source: arxiv.org/pdf/1801.00631.pdf
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 36. 36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved AI HACKING
  • 37. 37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: scientificamerican.com/article/how-to-hack-an-intelligent-machine
  • 39. 39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DATA SCIENCE JOURNEY
  • 40. 40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 41. 41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Start by Asking Relevant Questions  Specific (can you think of a clear answer?)  Measurable (quantifiable? data driven?)  Actionable (if you had an answer, could you do something with it?)  Realistic (can you get an answer with data you have?)  Timely (answer in reasonable timeframe?)
  • 42. 42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Preparation 1. Data analysis (audit for anomalies/errors) 2. Creating an intuitive workflow (formulate seq. of prep operations) 3. Validation (correctness evaluated against sample representative dataset) 4. Transformation (actual prep process takes place) 5. Backflow of cleaned data (replace original dirty data) Approx. 80% of Data Analyst’s job is Data Preparation! Example of multiple values used for U.S. States  California, CA, Cal., Cal
  • 43. 43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Feature Selection  Also known as variable or attribute selection  Why important? – simplification of models  easier to interpret by researchers/users – shorter training times – enhanced generalization by reducing overfitting  Dimensionality reduction vs feature selection – Dimensionality reduction: create new combinations of attributes – Feature selection: include/exclude attributes in data without changing them Q: Which features should you use to create a predictive model?
  • 44. 44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hyperparameters  Define higher-level model properties, e.g. complexity or learning rate  Cannot be learned during training  need to be predefined  Can be decided by – setting different values – training different models – choosing the values that test better  Hyperparameter examples – Number of leaves or depth of a tree – Number of latent factors in a matrix factorization – Learning rate (in many models) – Number of hidden layers in a deep neural network – Number of clusters in a k-means clustering
  • 45. 45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved  Residuals • residual of an observed value is the difference between the observed value and the estimated value  R2 (R Squared) – Coefficient of Determination • indicates a goodness of fit • R2 of 1 means regression line perfectly fits data  RMSE (Root Mean Square Error) • measure of differences between values predicted by a model and values actually observed • good measure of accuracy, but only to compare forecasting errors of different models (individual variables are scale-dependent)
  • 46. 46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved With that in mind…  No simple formula for “good questions” only general guidelines  The right data is better than lots of data  Understanding relationships matters
  • 47. 47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Training Set Learning Algorithm h hypothesis/model input output Ingest / Enrich Data Clean / Transform / Filter Select / Create New Features Evaluate Accuracy / Score
  • 48. 48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved MODEL TRAINING
  • 49. 49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scatter Data |label|features| |-12.0| [-4.9]| | -6.0| [-4.5]| | -7.2| [-4.1]| | -5.0| [-3.2]| | -2.0| [-3.0]| | -3.1| [-2.1]| | -4.0| [-1.5]| | -2.2| [-1.2]| | -2.0| [-0.7]| | 1.0| [-0.5]| | -0.7| [-0.2]| ... ... ...
  • 50. 50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Coefficients: 2.81 Intercept: 3.05 y = 2.81x + 3.05 Training Result
  • 51. 51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Linear Regression (two features) Coefficients: [0.464, 0.464] Intercept: 0.0563
  • 52. 52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SPARK ML PIPELINES
  • 53. 53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Feature transform 1 Feature transform 2 Combine features Linear Regression Input DataFrame Input DataFrame Output DataFrame Pipeline Pipeline Model Train Predict Export Model
  • 54. 54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark ML Pipeline  fit() is for training  transform() is for prediction Input DataFrame Input DataFrame Output Dataframe Pipeline Pipeline Model fit transform Train Predict
  • 55. 55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved rf = RandomForestClassifier(numTrees=100) pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf]) model = pipe.fit(trainData) # Train model results = model.transform(testData) # Test model SAMPLE CODE
  • 56. 56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX + HDP
  • 57. 57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: Google NIPS ML Code
  • 58. 58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  • 59. 59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved D ATA S C I E N C E P L AT F O R M Community Open Source Scale & Enterprise Security • Find tutorials and datasets • Connect with Data Scientists • Ask questions • Read articles and papers • Fork and share projects • Code in Scala/Python/R/SQL • Zeppelin & Jupyter Notebooks • RStudio IDE and Shiny • Apache Spark • Your favorite libraries • Data Science at Scale • Run Spark Jobs on HDP Cluster • Secure Hadoop Support • Ranger Atlas Support for Data • Support for ABAC Model Management • Data Shaping Pipeline UI • Auto-data preparation & modeling • Advanced Visualizations • Model management & deployment • Documented Model APIs Data Science Experience SPSS Modeler for DSX DO for DSX
  • 60. 60 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DSX HDP Data Science Experience (DSX) Local Enterprise Data Science platform for teams Livy REST interface Hortonworks Data Platform (HDP) Enterprise compute (Spark/Hive) & storage (HDFS/Ozone)
  • 61. 61 © Hortonworks Inc. 2011 – 2016. All Rights Reserved FINAL NOTES
  • 62. 62 © Hortonworks Inc. 2011 – 2016. All Rights Reserved “The business value of AI consists of its ability to lower the cost of prediction, just as computers lowered the cost of arithmetic.”
  • 63. 63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building a Business Around AI - HBR 1. Find and own valuable data no one else has 2. Take a systemic view of your business, and find data adjacencies 3. Package AI for the customer experience "No single tool, even one as powerful as AI, determines the fate of a business. As much as the world changes, deep truths — around unearthing customer knowledge, capturing scarce goods, and finding profitable adjacencies — will matter greatly. As ever, the technology works to the extent that its owners know what it can do, and know their market."
  • 64. 64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Robert Hryniewicz @RobHryniewicz Thanks!