Data Science Crash Course

Data Science
Crash Course - DataWorks Summit – Sydney 2017
Robert Hryniewicz
Developer Advocate
T: @robH8z
E: rhryniewicz@hortonworks.com

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Data Science?
Ã Extracting knowledge/insights from data
– Data: structured or unstructured
Ã Continuation of
– statistics
– machine learning
– data mining
– predictive analytics

What is Machine Learning?
“science of how computers learn
without being explicitly programmed”

Machine Learning Use Cases
Healthcare
Predict diagnosis
Prioritize screenings
Reduce re-admittance rates
Financial services
Fraud Detection/prevention
Predict underwriting risk
New account risk screens
Public Sector
Analyze public sentiment
Optimize resource allocation
Law enforcement & security
Retail
Product recommendation
Inventory management
Price optimization
Telco/mobile
Predict customer churn
Predict equipment failure
Customer behavior analysis
Oil & Gas
Predictive maintenance
Seismic data management
Predict well production levels

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Spark SQL Overview
Ã Spark module for structured data processing
Ã Three ways to manipulate data:
– DataFrames or Datasets API
– SQL queries

DataFrames
Ã Distributed collection of data organized into named
columns
Ã Conceptually equivalent to a table in relational DB or
a data frame in R/Python
Ã API available in Scala, Java, Python, and R
Col1 Col2 … … ColN
DataFrame
Column
Row
Data is described as a DataFrame
with rows, columns, and a schema

DataFrames
CSVAvro
HIVE
Spark SQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON

Spark SQL
Structured Data
Spark Streaming
Near Real-time
Spark MLlib
Machine Learning
GraphX
Graph Analysis

Algorithms

What is a ML Model?
Ã Mathematical formula with a number of parameters that need to be learned from the
data. And fitting a model to the data is a process known as model training
Ã E.g. linear regression
– Goal: fit a line y = mx + c to data points
– After model training: y = 2x + 5
Input OutputModel
1, 0, 7, 2, … 7, 5, 19, 9, …

START
Regression
Classification Collaborative Filtering
Clustering
Dimensionality Reduction
• Logistic Regression
• Support Vector Machines (SVM)
• Random Forest (RF)
• Naïve Bayes
• Linear Regression
• Alternating Least Squares (ALS)
• K-Means, LDA
• Principal Component Analysis (PCA)

CLASSIFICATION
Identifying to which category an object belongs to
Examples: spam detection, diabetes diagnosis, text labeling
Algorithms:
Ã Logistic Regression
– Fast training, linear model
– Classes expressed in probabilities
Ã Support Vector Machines (SVM)
– “Best” supervised learning algorithm, effective
– More robust to outliers than Log Regression
– Handles non-linearity
Ã Random Forest
– Fast training
– Handles categorical features
– Does not require feature scaling
– Captures non-linearity and
feature interaction
Ã Naïve Bayes
– Good for text classification
– Assumes independent variables

Visual Intro to Decision Trees
Ã http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION

REGRESSION
Predicting a continuous-valued output
Example: Predicting house prices based on number of bedrooms and square footage
Algorithms: Linear Regression

CLUSTERING
Automatic grouping of similar objects into sets (clusters)
Example: market segmentation – auto group customers into different market segments
Algorithms: K-means, LDA

COLLABORATIVE FILTERING
Fill in the missing entries of a user-item association matrix
Applications: Product/movie recommendation
Algorithms: Alternating Least Squares (ALS)

DIMENSIONALITY REDUCTION
Reducing the number of redundant features/variables
Applications:
Ã Removing noise in images by selecting only
“important” features
Ã Removing redundant features, e.g. MPH & KPH are
linearly dependent
Algorithms: Principal Component Analysis (PCA)

START
Regression
Classification Deep Learning
Clustering
Dimensionality Reduction
• XGBoost (Extreme Gradient Boosting) • Recurrent Neural Network (RNN)
• Convolutional Neural Network (CNN)
• Yinyang K-Means
• t-Distributed Stochastic Neighbor Embedding (t-SNE)
• Local Regression (LOESS)
Collaborative Filtering
• Weighted Alternating Least
Squares (WALS)

Data Science Lifecycle

Tools for Data Science Lifecycle

Data Preparation
1. Data analysis (audit for anomalies/errors)
2. Creating an intuitive workflow (formulate seq. of prep operations)
3. Validation (correctness evaluated against sample representative dataset)
4. Transformation (actual prep process takes place)
5. Backflow of cleaned data (replace original dirty data)
Approx. 80% of Data Analyst’s job is Data Preparation!
Example of multiple values used for U.S. States è California, CA, Cal., Cal

Asking Relevant Questions
Ã Specific (can you think of a clear answer?)
Ã Measurable (quantifiable? data driven?)
Ã Actionable (if you had an answer, could you do something with it?)
Ã Realistic (can you get an answer with data you have?)
Ã Timely (answer in reasonable timeframe?)

Hyperparameters
Ã Define higher-level model properties, e.g. complexity or learning rate
Ã Cannot be learned during training à need to be predefined
Ã Can be decided by
– setting different values
– training different models
– choosing the values that test better
Ã Hyperparameter examples
– Number of leaves or depth of a tree
– Number of latent factors in a matrix factorization
– Learning rate (in many models)
– Number of hidden layers in a deep neural network
– Number of clusters in a k-means clustering

Feature Selection
Ã Also known as variable or attribute selection
Ã Why important?
– simplification of models è easier to interpret by researchers/users
– shorter training times
– enhanced generalization by reducing overfitting
Ã Dimensionality reduction vs feature selection
– Dimensionality reduction: create new combinations of attributes
– Feature selection: include/exclude attributes in data without changing them
Q: Which features should you use to create a predictive model?

With that in mind…
Ã No simple formula for “good questions” only general guidelines
Ã The right data is better than lots of data
Ã Understanding relationships matters

Training Set
Learning Algorithm
h
hypothesis/model
input output
Ingest / Enrich Data
Clean / Transform / Filter
Select / Create New Features
Evaluate Accuracy / Score

Building Spark ML pipelines
Feature
transform
1
Feature
transform
2
Combine
features
Linear
Regression
Input
DataFrame
Input
DataFrame
Output
DataFrame
Pipeline
Pipeline Model
Train
Predict
Export Model

Spark ML Pipeline
Ã fit() is for training
Ã transform() is for prediction
Input
DataFrame
(TRAIN)
Input
DataFrame
(TEST)
Output
Dataframe
(PREDICTIONS)
Pipeline
Pipeline Model
fit()
transform()
Train
Predict

Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
Ã Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

HDCloud

Hortonworks Cloud Solutions
Microsoft AWS Google
Managed Azure HDInsight
Non-Managed /
Marketplace
Hortonworks Data
Cloud for AWS
Cloud IaaS
Hortonworks Data Platform
(via Ambari and via Cloudbreak)

Zeppelin
Ambari
Spark History Server
Files View

Apache Zeppelin with HDP 2.6
• Data exploration and discovery
• Visualization
• Interactive snippet-at-a-time
experience
• “Modern Data Science Studio”
Features
• Ad-hoc experimentation
• Deeply integrated with
Spark + Hadoop
• Supports multiple
language backends
• Incubating at Apache
Use Case
Web-based Notebook for interactive analytics

Labs / Tutorials

Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|
...
...
...

Linear Regression Model Training (one feature)
Coefficients: 2.81 Intercept: 3.05
y = 2.81x + 3.05
Training
Result

Linear Regression (two features)
Coefficients: [0.464, 0.464]
Intercept: 0.0563

ML Lab
• Residuals
• residual of an observed value is the difference between the observed value and
the estimated value
• R2 (R Squared) – Coefficient of Determination
• indicates a goodness of fit
• R2 of 1 means regression line perfectly fits data
• RMSE (Root Mean Square Error)
• measure of differences between values predicted by a model and values actually
observed
• good measure of accuracy, but only to compare forecasting errors of different
models (individual variables are scale-dependent)

Thanks!
Robert Hryniewicz
@RobH8z
rhryniewicz@hortonworks.com

Data Science Crash Course

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Science Crash Course

Ähnlich wie Data Science Crash Course (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Science Crash Course