In this talk I will share insights and knowledge that I have gained from building up a Data Science department from scratch. This talk will be split into three sections:
1. I'll begin by defining what Data Science is, how it is related to Machine Learning and share some tips for introducing Data Science to your organisation.
2. Next up well run through some commonly used Machine Learning algorithms used by Data Scientists, along with examples for use cases where these algorithms can be applied.
3. The final third of the talk will be a demonstration of how you can quickly get started with Data Science and Machine Learning using Python and the Open Source scikit-learn Library.
3. Who Am I?
• Previously Java Developer and Architect.
@markawest
4. Who Am I?
• Previously Java Developer and Architect.
• Currently building and managing a team of
Data Scientists at Bouvet Oslo.
@markawest
5. Who Am I?
• Previously Java Developer and Architect.
• Currently building and managing a team of
Data Scientists at Bouvet Oslo.
• Leader javaBin (Norwegian Java User Group).
@markawest
10. What is Data Science?
What is Data
Science?
Machine
Learning
Algorithms
Practical
Example
@markawest
11. @markawest
“Data Science… is an interdisciplinary
field of scientific methods, processes,
and systems to extract knowledge or
insight from data…”
Wikipedia
12. @markawest
“Data Science… is an interdisciplinary
field of scientific methods, processes,
and systems to extract knowledge or
insight from data…”
Wikipedia
17. @markawest
“Data Science… is an interdisciplinary
field of scientific methods, processes,
and systems to extract knowledge or
insight from data…”
Wikipedia
18. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interperetation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
19. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interperetation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
20. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interperetation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
21. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interperetation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
22. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interpretation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
23. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interpretation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
24. @markawest
1. Question 2. Data
3. Exploratory
Data Analysis
4. Formal
Modelling
5. Interpretation 6. Communication 7. Result
Data Science Process : Hypothesis Driven
25. @markawest
Roles Required in a Data Science Project
• Prove / disprove
hypotheses.
• Information and
Data Gathering.
• Data Wrangling.
• Algorithm and ML
models.
• Communication.
Data
Scientist
• Build Data Driven
Platforms.
• Operationalize
Algorithms and
Machine Learning
models.
• Data Integration.
Data
Engineer
• Storytelling.
• Build Dashboards
and other Data
visualizations.
• Provide insight
through visual
means.
Visualization
Expert
• Project
Management.
• Manage
stakeholder
expectations.
• Maintain a Vision.
• Facilitate.
Process
Owner
26. @markawest
Roles Required in a Data Science Project
• Prove / disprove
hypotheses.
• Information and
Data gathering.
• Data wrangling.
• Algorithm and ML
models.
• Communication.
Data
Scientist
• Build Data Driven
Platforms.
• Operationalize
Algorithms and
Machine Learning
models.
• Data Integration.
Data
Engineer
• Storytelling.
• Build Dashboards
and other Data
visualizations.
• Provide insight
through visual
means.
Visualization
Expert
• Project
Management.
• Manage
stakeholder
expectations.
• Maintain a Vision.
• Facilitate.
Process
Owner
27. @markawest
Roles Required in a Data Science Project
• Prove / disprove
hypotheses.
• Information and
Data gathering.
• Data wrangling.
• Algorithm and ML
models.
• Communication.
Data
Scientist
• Build Data Driven
Platforms.
• Operationalize
Algorithms and
Machine Learning
models.
• Data Integration.
• Monitoring.
Data
Engineer
• Storytelling.
• Build Dashboards
and other Data
visualizations.
• Provide insight
through visual
means.
Visualization
Expert
• Project
Management.
• Manage
stakeholder
expectations.
• Maintain a Vision.
• Facilitate.
Process
Owner
28. @markawest
Roles Required in a Data Science Project
• Prove / disprove
hypotheses.
• Information and
Data gathering.
• Data wrangling.
• Algorithm and ML
models.
• Communication.
Data
Scientist
• Build Data Driven
Platforms.
• Operationalize
Algorithms and
Machine Learning
models.
• Data Integration.
• Monitoring.
Data
Engineer
• Storytelling.
• Build Dashboards
and other Data
visualizations.
• Provide insight
through visual
means.
Data
Visualization
• Project
Management.
• Manage
stakeholder
expectations.
• Maintain a Vision.
• Facilitate.
Process
Owner
29. @markawest
Roles Required in a Data Science Project
• Prove / disprove
hypotheses.
• Information and
Data gathering.
• Data wrangling.
• Algorithm and ML
models.
• Communication.
Data
Scientist
• Build Data Driven
Platforms.
• Operationalize
Algorithms and
Machine Learning
models.
• Data Integration.
• Monitoring.
Data
Engineer
• Storytelling.
• Build Dashboards
and other Data
visualizations.
• Provide insight
through visual
means.
Data
Visualization
• Project
Management.
• Manage
stakeholder
expectations.
• Maintain a Vision.
• Facilitate.
• Evangelize.
Process
Owner
30. @markawest
“Data Science… is an interdisciplinary
field of scientific methods, processes,
and systems to extract knowledge or
insight from data…”
Wikipedia
31. Isn’t Data Science just
a rebranding of
Business Intelligence?
@markawest
32. @markawest
Data Science as an Evolution of BI
Business Intelligence Data Science Adds..
Data
Sources
Structured Data, most often
from Relational Database
Management Systems (RDBMS).
Unstructured Data (log files, audio,
images, emails, tweets, raw text,
documents).
Available
Tools
Data Visualization, Statistics. Machine Learning.
Goals Provide support to strategic
decision making, based on
historical data.
Provide business value through
advanced functionality.
Source: https://www.linkedin.com/pulse/data-science-business-intelligence-whats-difference-david-rostcheck
34. @markawest
Machine Learning: A Tool for Data Science
Artificial
Intelligence
Artificial Intelligence
Enabling computers to mimic human
intelligence and behavior.
35. @markawest
Machine Learning: A Tool for Data Science
Artificial
Intelligence
Machine
Learning
Artificial Intelligence
Enabling computers to mimic human
intelligence and behavior.
Machine Learning
Algorithms allowing computers to learn, make
predictions and describe data without being
explicitly programmed.
36. @markawest
Machine Learning: A Tool for Data Science
Artificial
Intelligence
Machine
Learning
Deep
Learning
Machine Learning
Algorithms allowing computers to learn, make
predictions and describe data without being
explicitly programmed.
Artificial Intelligence
Enabling computers to mimic human
intelligence and behavior.
Deep Learning
Black box learning with multi-layered Neural
Networks.
37. What is Data Science: Key Takeaways
• Data Scientists require Math and Statistics skills in addition to
traditional Software Development.
• Data Science is Hypothesis Driven.
• Data Science projects require a range of competencies/roles.
• Data Science can be seen as an evolution of Business Intelligence,
providing additional capabilities through the application of cutting
edge technologies and unstructured data.
@markawest
39. @markawest
“Machine Learning:
Field of study that gives
computers the ability to
learn without being
explicitly programmed.”
Arthur L. Samuel
IBM Journal of Research and Development, 1959
Computer
Data
Rules
Output
Computer
Data
Output
Rules
Traditional Programming
Machine Learning
40. Generalized
Captures the correlations in
your training data. May have
an error margin.
The Art of The Generalized Model
@markawest
Underfitted Overfitted
Model memorizes the
training data rather than
finding underlying patterns.
Model overlooks underlying
patterns in your training
data.
41. Supervised Learning
Machine Learning Types
@markawest
Unsupervised Learning
Model trained on historical
data. Resulting model can be
used to make predictions on
new data.
Use Case: Predicting a value
based on patterns discovered
in previous data.
Algorithm finds trends and
patterns in data, without
prior training on historical
data.
Use Case: Describing your
data based on statistical
analysis.
Reinforcement Learning
Model uses a feedback loop
to iteratively improve it’s
performance.
Use Case: Learning how to
best solve a problem based
on trial and error.
48. Fitting a trend line: Ordinary Least Squares
@markawest
a
b
c
d
e
f
a2 + b2 + c2 + d2 + e2 + f2 = sum of squared error
Outlier?
49. Linear Regression Notes
Benefits
• Simple to
understand.
• Transparent.
Limitations
• Outliers skew
trend line.
• Doesn’t work
with non-
linear
relationships.
Some
Alternatives
• Non-linear
Least Squares.
• Tree
algorithms.
@markawest
50. Example Machine Learning Algorithms
@markawest
Supervised Learning Unsupervised Learning
Linear
Regression
ClassificationRegression
K-Means
Clustering
Decision Trees
51. Decision Tree: Calculating the Best Split
@markawest
Name Placements Complaints Lived in Norway Payrise
Don Yes Yes Yes Yes
Lewis Yes Yes No Yes
Mike Yes No Yes Yes
Danny Yes Yes No Yes
Dan No No Yes No
Elliot Yes No No Yes
Luke Yes No No Yes
Tom Yes Yes No Yes
Nathan No Yes Yes No
Owen Yes No No Yes
Goal: Build a
Decision Tree for
deciding who gets a
payrise this year,
based on historical
payrise data.
Features Labels
52. Decision Tree: Calculating the Best Split
@markawest
Name Placements Complaints Lived in Norway Payrise
Don Yes Yes Yes Yes
Lewis Yes Yes No Yes
Mike Yes No Yes Yes
Danny Yes Yes No Yes
Dan No No Yes No
Elliot Yes No No Yes
Luke Yes No No Yes
Tom Yes Yes No Yes
Nathan No Yes Yes No
Owen Yes No No Yes
Lived in
Norway
Yes No
53. Decision Tree: Calculating the Best Split
@markawest
Name Placements Complaints Lived in Norway Payrise
Don Yes Yes Yes Yes
Lewis Yes Yes No Yes
Mike Yes No Yes Yes
Danny Yes Yes No Yes
Dan No No Yes No
Elliot Yes No No Yes
Luke Yes No No Yes
Tom Yes Yes No Yes
Nathan No Yes Yes No
Owen Yes No No Yes
Complaints
Yes No
54. Decision Tree: Calculating the Best Split
@markawest
Name Placements Complaints Lived in Norway Payrise
Don Yes Yes Yes Yes
Lewis Yes Yes No Yes
Mike Yes No Yes Yes
Danny Yes Yes No Yes
Dan No No Yes No
Elliot Yes No No Yes
Luke Yes No No Yes
Tom Yes Yes No Yes
Nathan No Yes Yes No
Owen Yes No No Yes
Placements
Yes No
55. Decision Tree: Calculating the Best Split
@markawest
Placements
Yes No
Complaints
Yes No
Lived in
Norway
Yes No
Recruiters Placements Complaints Lived in Norway Payrise
8 8 4 2 Yes
2 0 1 2 No
56. Building a Decision Tree: A Bad Split?
@markawest
Placements
Yes No
Complaints
Yes No
Lived in
Norway
Yes No
Recruiters Placements Complaints Lived in Norway Payrise
8 7 8 2 Yes
2 1 0 2 No
57. Decision Tree: Recursive Partitioning
@markawest
Outlook Temp Humidity Wind Play
Sunny Hot High Weak No
Sunny Hot High Strong No
Overcast Hot High Weak Yes
… … … … …
… … … … …
Overcast Mild High Strong Yes
Overcast Hot Normal Weak Yes
Rain Mild High Strong No
No Yes No Yes
Yes
Outlook
Humidity Wind
Features Labels
Overcast
Sunny Rain
High WeakNormal Strong
58. Building a Decision Tree: Where to Stop?
@markawest
#1 : All Data at
current leaf
belongs to the
same class.
No Yes No Yes
YesHumidity Wind
Overcast
Sunny Rain
High Normal Strong
Outlook
59. Building a Decision Tree: Where to Stop?
@markawest
No Yes No Yes
YesHumidity Wind
Overcast
Sunny Rain
High Normal Strong
Outlook
#2 : Maximum tree
depth reached.
60. Decision Tree Notes
Benefits
• White Box.
• Flexible (use for
both regression
and classification).
• Robust to outliers.
• Handle non-linear
boundaries.
Limitations
• Susceptible to
overfitting.
• Changes to where
the Data is sliced
can produce
different results.
Some Alternatives
• Support Vector
Machine.
• Logistic
Regression.
• Random Forests.
@markawest
61. Example Machine Learning Algorithms
@markawest
Supervised Learning Unsupervised Learning
Linear
Regression
ClassificationRegression
K-Means
Clustering
Decision Trees
62. K-Means Clustering
@markawest
• K = The amount of clusters the
algorithm will try to find.
• K = Should be large enough to
extract meaningful patterns but
small enough that clusters remain
clearly distinct.
• So how do we calculate K?
63. Sum of Squared Errors
@markawest
a b
c
de
f
a2 + b2 + c2 + d2 + e2 + f2 = sum of squared error
a
b
c
d
e
f
67. K-Means: Calculating the K value
@markawest
• Scree Plots allow us to find
optimal number of clusters.
• Shows the Sum of Squared
Errors for different
numbers of clusters.
• The optimal K value is at
the “Elbow” of the plot.
74. K-Means Demo
After 6 iterations: Clusters and centroids stablise, algorithm stops
@markawest
75. K-Means Clustering Notes
Benefits
• Fast and highly
effective at
uncovering basic
data patterns.
• Works best for
spherical, non-
overlapping
clusters.
Limitations
• Each data point
can only be
assigned to one
cluster.
• Clusters are
assumed to be
spherical.
Some Alternatives
• Gaussian mixtures.
• Fuzzy K-Means.
@markawest
76. Machine Learning Algorithms: Key Takeaways
@markawest
• The three main types of Machine Learning are Supervised,
Unsupervised and Reinforcement Learning.
• Machine Learning is more than Neural Networks and Deep Learning.
• A successful Machine Learning Model needs to find the balance
between Overfitting and Underfitting.
• Machine Learning Algorithms are merely tools. Good results come
from understanding how they work and tuning them correctly.
78. Use Case: Titanic Passenger Survival
@markawest
Goal: Build a
classification model
for predicting
Titanic survivability.
79. Hypothesis
That it is possible
to predict Titanic
survivability based
on Age, Gender
and Ticket Class.
@markawest
80. @markawest
Variable Description
PassengerId Unique Identifier
Survival Survived = 1, Died = 0
Pclass Ticket class (1, 2 or 3)
Sex Gender (‘male’ or ’female’)
Age Age in years
Sibsp Number siblings / spouses aboard the Titanic
Parch Number parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of Embarkation
Name Passenger name, including honorific.
Titanic
Dataset
83. Practical Example: Key Takeaways
@markawest
• Scikit-learn and Jupyter Notebooks provide a free and flexible basis for starting
with Data Science. Use the Anaconda distribution to save time on installation!
• Feature Engineering is a vital skill for Data Scientists.
• Domain Knowledge is key to succeed!
• Split your data into Test and Training sets.
• Tweaking Hyperparameters may give better results (but you should be able to
explain how your tweak improved model performance).
84. Tips for Getting Started with Data Science
@markawest
• Become a Data Engineer!
• Learn Python or R (SQL is also useful)!
• Learn some statistical methods!
• Take an online Data Science course (i.e. Udemy DS Nano Degree)!
• Understand the Data Science process!
• Join a Meetup!
• Practice with Kaggle!
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
Speaking for me is a hobby that I do to learn and share my own knowledge and experiences. In the past couple of years I have spoken at a range of conference across Europe and the US. The good news is that this is the first time I have spoken at NDC. This is also the first time I have given this specific talk so I am excited to hear your feedback.
So lets get started!
Speaking for me is a hobby that I do to learn and share my own knowledge and experiences. In the past couple of years I have spoken at a range of conference across Europe and the US. The good news is that this is the first time I have spoken at NDC. This is also the first time I have given this specific talk so I am excited to hear your feedback.
So lets get started!
Speaking for me is a hobby that I do to learn and share my own knowledge and experiences. In the past couple of years I have spoken at a range of conference across Europe and the US. The good news is that this is the first time I have spoken at NDC. This is also the first time I have given this specific talk so I am excited to hear your feedback.
So lets get started!
Here is the Agenda for my talk. As you can see it is split into four sections.
I’ll then do on to define what Data Science is, what parts are most relevant for us, and out Data Science is linked with Machine Learning and Aritifical Intelligence. I’ll also talk about the drivers behind Data Science projects that the roles that these projects require.
Machine Learning is the most popular application of Data Science at the moment, and I’ll therefore use some time to define the categories and types of Machine Learning algorithms, and give you some examples of the most commonly used algorithms.
Finally I will show you a practical example of applied Data Science, and show you how Data Science is more than just Machine Learning.
Right, so whats the motivation. Why am I here today?
Tip : Possibly replace this with Bouvet’s own methodology if it is ready and good enough.
Ok, so lets move on to the second part of my talk – Machine Learning algorithms.
Machine Learning is all about giving computers a framework to create their own logic or rules, without these being programmed by a human. Look at it as an inversion of control when compared to traditional programming.
An underfitted model is likely to neglect significant trends, which would cause it to yield less accurate predictions for both current and future data.
An overfitted model would yield highly accurate predictions for the current data, but would be less generalizable to future data.
But when parameters are tuned just right, such as shown in Figure 2b, the algorithm strikes a balance between identifying major trends and discounting minor variations, rendering the resulting model well-suited for making predictions.
Note – more complex models are prone to overfitting.
Note that reinforcement learning continuously improves itself, which supervised and unsupervised models will have to be built again to reflect new data. So if your use case requires you to
Other forms of Regression Model that are popular include Non-Regression, which is used for modelling non-linear trend lines, and Logistic Regression, which is a form of Regression where the trend line is used to separate data points into classes.
Multicollinearity
You go to see a rock and roll band with two great guitar players. You're eager to see which one plays best. But on stage, they're both playing furious leads at the same time! When they're both playing loud and fast, how can you tell which guitarist has the biggest effect on the sound? Even though they aren't playing the same notes, what they're doing is so similar it's difficult to tell one from the other.
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
But first, who the devil am I? As you can see from my twitter handle my name is Mark West, and I’m an English living here in Oslo, Norway.
As decision trees are grown by splitting data points into homogeneous groups, a slight change in the data could trigger changes to the split, and result in a different tree.
Why Random Forests
As decision trees also aim for the best way to split data points each time, they are vulnerable to overfitting (see Chapter 1.3). Inaccuracy. Using the best binary question to split the data at the start might not lead to the most accurate predictions.
Sometimes, less effective splits used initially may lead to better predictions subsequently.
More Data beats complex algorithms : It’s all about the DATA!!!!
Garbage in, Garbage out!!
Right, so whats the motivation. Why am I here today?
survival – Did the passenger survive?
pclass – Which
sex
age
sibsp
parch
ticket
Fare
cabin
embarked