SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
Data Science With Python
Mosky
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
2
Data Science
➤ = Extract knowledge or insights from data.
➤ Data science includes:
➤ Visualization
➤ Statistics
➤ Machine learning
➤ Deep learning
➤ Big data
➤ And related methods
➤ ≈ Data mining
3
We will introduce.
➤ It's kind of outdated, but still contains lot of keywords.
➤ MrMimic/data-scientist-roadmap – GitHub
➤ Becoming a Data Scientist – Curriculum via Metromap
5
➤ Machine learning = statistics - checking of assumptions 😆
➤ But does resolve more problems. "
➤ Statistics constructs more solid inferences.
➤ Machine learning constructs more interesting predictions.
Statistics vs. Machine Learning
6
Probability, Descriptive Statistics, and Inferential Statistics
7
Population
Sample
Probability
Descriptive

Statistics
Inferential
Statistics
➤ Deep learning is the most renowned part of machine learning.
➤ A.k.a. the “AI”.
➤ Deep learning uses artificial neural networks (NNs).
➤ Which are especially good at:
➤ Computer vision (CV) 👀
➤ Natural language processing (NLP) 📖
➤ Machine translation
➤ Speech recognition
➤ Too costly to simple problems.
Machine Learning vs. Deep Learning
8
Big Data
➤ The “size” is constantly moving.
➤ As of 2012, ranges from 10n TB to n PB, which is 100x.
➤ Has high-3Vs:
➤ Volume, amount of data.
➤ Velocity, speed of data in and out.
➤ Variety, range of data types and sources.
➤ A practical definition:
➤ A single computer can't process in a reasonable time.
➤ Distributed computing is a big deal.
9
Today,
➤ “Models” are the math models.
➤ “Statistical models”, emphasize inferences.
➤ “Machine learning models”, emphasize predictions.
➤ “Deep learning” and “big data” are gigantic subfields.
➤ We won't introduce.
➤ But the learning resources are listed at the end.
10
Mosky
➤ Python Charmer at Pinkoi.
➤ Has spoken at
➤ PyCons in 

TW, MY, KR, JP, SG, HK,

COSCUPs, and TEDx, etc.
➤ Countless hours 

on teaching Python.
➤ Own the Python packages:
➤ ZIPCodeTW, 

MoSQL, Clime, etc.
➤ http://mosky.tw/
11
The Outline
➤ “Data”
➤ The Analysis Steps
➤ Visualization
➤ Preprocessing
➤ Dimensionality Reduction
➤ Statistical Models
➤ Machine Learning Models
➤ Keep Learning
12
The Packages
➤ $ pip3 install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
➤ Or
➤ > conda install jupyter numpy scipy sympy matplotlib
ipython pandas seaborn statsmodels scikit-learn
13
Common Jupyter Notebook Shortcuts
14
Esc Edit mode → command mode.
Ctrl-Enter Run the cell.
B Insert cell below.
D, D Delete the current cell.
M To Markdown cell.
Cmd-/ Comment the code.
H Show keyboard shortcuts.
P Open the command palette.
Checkpoint: The Packages
➤ Open 00_preface_the_packages.ipynb up.
➤ Run it.
➤ The notebooks are available on https://github.com/moskytw/
data-science-with-python.
15
”Data”
“Data”
➤ = Variables
➤ = Dimensions
➤ = Labels + Features
17
Data in Different Types
18
Discrete
Nominal {male, female}
Ordinal

Ranked
↑ & can be ordered. {great > good > fair}
Continuous
Interval ↑ & distance is meaningful. temperatures
Ratio ↑ & 0 is meaningful. weights
Data in the X-Y Form
19
y x
dependent variable independent variable
response variable explanatory variable
regressand regressor
endogenous variable | endog exogenous variable | exog
outcome design
label feature
➤ Confounding variables:
➤ May affect y, but not x.
➤ May lead erroneous conclusions, “garbage in, garbage out”.
➤ Controlling, e.g., fix the environment.
➤ Randomizing, e.g, choose by computer.
➤ Matching, e.g., order by gender and then assign group.
➤ Statistical control, e.g., BMI to remove height effect.
➤ Double-blind, even triple-blind trials.
20
Get the Data
➤ Logs
➤ Existent datasets
➤ The Datasets Package – StatsModels
➤ Kaggle
➤ Experiments
21
The Analysis Steps
The Three Steps
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
23
1. Define Assumption
➤ Specify a feasible objective.
➤ “Use AI to get the moon!”
➤ Write an formal assumption.
➤ “The users will buy 1% items from our recommendation.”

rather than “The users will love our recommendation!”
➤ Note the dangerous gaps.
➤ “All the items from recommendation are free!”
➤ “Correlation does not imply causation.”
➤ Consider the next actions.
➤ “Release to 100% of users.” rather than “So great!”
24
2. Validate Assumption
➤ Collect potential data.
➤ List possible methods.
➤ A plotting, median, or even mean may be good enough.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Evaluate the metrics of methods with data.
25
3. Validated Assumption?
➤ Yes → Congrats! Report fully and take the actions! 🎉
➤ No → Check:
➤ The hypotheses of methods.
➤ The confounding variables in data.
➤ The formality of assumption.
➤ The feasibility of objective.

26
Iterate Fast While Industry Changes Rapidly
➤ Resolve the small problems first.
➤ Resolve the high impact/effort problems first.
➤ One week to get a quick result and improve

rather than one year to get the may-be-the-best result.
➤ Fail fast!
27
Checkpoint: Pick up a Method
➤ Think of an interesting problem.
➤ E.g., revenue is higher, but is it random?
➤ Pick one method from the cheatsheets.
➤ Selecting Statistical Tests – Bates College
➤ Choosing a statistical test – HBS
➤ Choosing the right estimator – Scikit-Learn
➤ Remember the three analysis steps.
28
Visualization
Visualization
➤ Make Data Colorful – Plotting
➤ 01_1_visualization_plotting.ipynb
➤ In a Statistical Way – Descriptive Statistics
➤ 01_2_visualization_descriptive_statistics.ipynb
30
➤ Star98
➤ star98_df = sm.datasets.star98.load_pandas().data
➤ Fair
➤ fair_df = sm.datasets.fair.load_pandas().data
➤ Howell1
➤ howell1_df = pd.read_csv(

'dataset_howell1.csv', sep=';')
➤ Or your own datasets.
➤ Plot the variables that interest you.
Checkpoint: Plot the Variables
31
Preprocessing
Feed the Data That Models Like
33
➤ Preprocess data for:
➤ Hard requirements, e.g.,
➤ corpus → vectors
➤ “What kind of news will be voted down on PTT?”
➤ Soft requirements (hypotheses), e.g.,
➤ t-test: better when samples are normally distributed.
➤ SVM: better when features range from -1 to 1.
➤ More representative features, e.g., total price / units.
➤ Note that different models have different tastes.
Preprocessing
➤ The Dishes – Containers
➤ 02_1_preprocessing_containers.ipynb
➤ A Cooking Method – Standardization
➤ 02_2_preprocessing_standardization.ipynb
➤ Watch Out for Poisonous Data Points – Removing Outliers
➤ 02_3_preprocessing_removing_outliers.ipynb
34
➤ Try to standardize and compare.
➤ Try to trim the outliners.
Checkpoint: Preprocess the Variables
35
Dimensionality
Reduction
The Model Sicks Up!
➤ Let's reduce the variables.
➤ Feed a subset → feature selection.
➤ Feature selection using SelectFromModel – Scikit-Learn
➤ Feed a transformation → feature extraction.
➤ PCA, FA, etc.
➤ Another definition: non-numbers → numbers.
37
➤ Principal Component Analysis
➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb
➤ Factor Analysis
➤ 03_2_dimensionality_reduction_factor_analysis.ipynb
Dimensionality Reduction
38
➤ Try to PCA(all variables) → the better components, or FA.
➤ And then plot n-dimensional data onto 2-dimensional plane.
Checkpoint: Reduce the Variables
39
Statistical Models
Statistical Models
➤ Identify Boring or Interesting – Hypothesis Testings
➤ 04_1_statistical_models_hypothesis_testings.ipynb
➤ “Hypothesis Testing With Python”
➤ Identify X-Y Relationships – Regression
➤ 04_2_statistical_models_regression_anova.ipynb
41
More Regression Models
➤ If y is not linear,
➤ Logit or Poisson Regression | Generalized Linear Models, GLMs
➤ If y is correlated,
➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE
➤ If x has multicollinearity,
➤ Lasso or Ridge Regression
➤ If error term is heteroscedastic,
➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS
➤ If x is time series – predict x0 from x-1, not predict y from x,
➤ Autoregressive Integrated Moving Average, ARIMA
42
➤ Try to apply the analysis steps with a statistical method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Statistical Method
43
Machine Learning
Models
➤ Apple or Orange? – Classification
➤ 05_1_machine_learning_models_classification.ipynb
➤ Without Labels – Clustering
➤ 05_2_machine_learning_models_clustering.ipynb
➤ Predict the Values – Regression
➤ Who Are the Best? – Model Selection
➤ sklearn.model_selection.GridSearchCV
Machine Learning Models
45
Confusion matrix, where A = 002 = C[0, 0]
46
predicted -
AC
predicted +
BD
actual -
AB
true -
A
false +
B
actual +
CD
false -
C
true +
D
➤ precision = D / BD
➤ recall = D / CD
➤ sensitivity = D / CD = recall = observed power
➤ specificity = A / AB = observed confidence level
➤ false positive rate = B / AB = observed α
➤ false negative rate= C / CD = observed β
Common “rates” in confusion matrix
47
Ensemble Models
➤ Bagging
➤ N independent models and average their output.
➤ e.g., the random forest models.
➤ Boosting
➤ N sequential models, the n model learns from n-1's error.
➤ e.g., gradient tree boosting.
48
➤ Try to apply the analysis steps with a ML method.
1. Define Assumption
2. Validate Assumption
3. Validated Assumption?
Checkpoint: Apply a Machine Learning Method
49
Keep Learning
Keep Learning
➤ Statistics
➤ Seeing Theory
➤ Biological Statistics
➤ scipy.stats + StatsModels
➤ Research Methods
➤ Machine Learning
➤ Scikit-learn Tutorials
➤ Standford CS229
➤ Hsuan-Tien Lin

➤ Deep Learning
➤ TensorFlow | PyTorch
➤ Standford CS231n
➤ Standford CS224n
➤ Big Data
➤ Dask
➤ Hive
➤ Spark
➤ HBase
➤ AWS
51
The Facts
➤ ∵
➤ You can't learn all things in the data science!
➤ ∴
➤ “Let's learn to do” ❌
➤ “Let's do to learn” ✅
52
The Learning Flow
1. Ask a question.
➤ “How to tell the differences confidently?”
2. Explore the references.
➤ “T-test, ANOVA, ...”
3. Digest into an answer.
➤ Explore by the breadth-first way.
➤ Write the code.
➤ Make it work, make it right, finally make it fast.
53
Recap
➤ Let's do to learn, not learn to do.
➤ What is your objective?
➤ For the objective, what is your assumption?
➤ For the assumption, what method may validate it?
➤ For the method, how will you evaluate it with data?
➤ Q & A
54

Weitere ähnliche Inhalte

Was ist angesagt?

What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 

Was ist angesagt? (20)

Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
Python for Data Science | Python Data Science Tutorial | Data Science Certifi...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Data analytics
Data analyticsData analytics
Data analytics
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Introduction to data science club
Introduction to data science clubIntroduction to data science club
Introduction to data science club
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 

Ähnlich wie Data Science With Python

Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Tech Triveni
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 

Ähnlich wie Data Science With Python (20)

How to apply deep learning to 3 d objects
How to apply deep learning to 3 d objectsHow to apply deep learning to 3 d objects
How to apply deep learning to 3 d objects
 
How to track and organize your experimentation process
How to track and organize your experimentation processHow to track and organize your experimentation process
How to track and organize your experimentation process
 
0-introduction.pdf
0-introduction.pdf0-introduction.pdf
0-introduction.pdf
 
Market and Social Research Part 9
Market and Social Research Part 9Market and Social Research Part 9
Market and Social Research Part 9
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
Hypothesis Testing With Python
Hypothesis Testing With PythonHypothesis Testing With Python
Hypothesis Testing With Python
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack Big data expo - machine learning in the elastic stack
Big data expo - machine learning in the elastic stack
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
5 Practical Steps to a Successful Deep Learning Research
5 Practical Steps to a Successful  Deep Learning Research5 Practical Steps to a Successful  Deep Learning Research
5 Practical Steps to a Successful Deep Learning Research
 
Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 

Mehr von Mosky Liu

Mehr von Mosky Liu (18)

Statistical Regression With Python
Statistical Regression With PythonStatistical Regression With Python
Statistical Regression With Python
 
Practicing Python 3
Practicing Python 3Practicing Python 3
Practicing Python 3
 
Elegant concurrency
Elegant concurrencyElegant concurrency
Elegant concurrency
 
Boost Maintainability
Boost MaintainabilityBoost Maintainability
Boost Maintainability
 
Beyond the Style Guides
Beyond the Style GuidesBeyond the Style Guides
Beyond the Style Guides
 
Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015Simple Belief - Mosky @ TEDxNTUST 2015
Simple Belief - Mosky @ TEDxNTUST 2015
 
Concurrency in Python
Concurrency in PythonConcurrency in Python
Concurrency in Python
 
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address FuzzilyZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
ZIPCodeTW: Find Taiwan ZIP Code by Address Fuzzily
 
Graph-Tool in Practice
Graph-Tool in PracticeGraph-Tool in Practice
Graph-Tool in Practice
 
Minimal MVC in JavaScript
Minimal MVC in JavaScriptMinimal MVC in JavaScript
Minimal MVC in JavaScript
 
Learning Git with Workflows
Learning Git with WorkflowsLearning Git with Workflows
Learning Git with Workflows
 
Dive into Pinkoi 2013
Dive into Pinkoi 2013Dive into Pinkoi 2013
Dive into Pinkoi 2013
 
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
MoSQL: More than SQL, but Less than ORM @ PyCon APAC 2013
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
MoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORMMoSQL: More than SQL, but less than ORM
MoSQL: More than SQL, but less than ORM
 
Introduction to Clime
Introduction to ClimeIntroduction to Clime
Introduction to Clime
 
Programming with Python - Adv.
Programming with Python - Adv.Programming with Python - Adv.
Programming with Python - Adv.
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - Basic
 

Kürzlich hochgeladen

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Kürzlich hochgeladen (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Data Science With Python

  • 1. Data Science With Python Mosky
  • 2. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 2
  • 3. Data Science ➤ = Extract knowledge or insights from data. ➤ Data science includes: ➤ Visualization ➤ Statistics ➤ Machine learning ➤ Deep learning ➤ Big data ➤ And related methods ➤ ≈ Data mining 3 We will introduce.
  • 4.
  • 5. ➤ It's kind of outdated, but still contains lot of keywords. ➤ MrMimic/data-scientist-roadmap – GitHub ➤ Becoming a Data Scientist – Curriculum via Metromap 5
  • 6. ➤ Machine learning = statistics - checking of assumptions 😆 ➤ But does resolve more problems. " ➤ Statistics constructs more solid inferences. ➤ Machine learning constructs more interesting predictions. Statistics vs. Machine Learning 6
  • 7. Probability, Descriptive Statistics, and Inferential Statistics 7 Population Sample Probability Descriptive
 Statistics Inferential Statistics
  • 8. ➤ Deep learning is the most renowned part of machine learning. ➤ A.k.a. the “AI”. ➤ Deep learning uses artificial neural networks (NNs). ➤ Which are especially good at: ➤ Computer vision (CV) 👀 ➤ Natural language processing (NLP) 📖 ➤ Machine translation ➤ Speech recognition ➤ Too costly to simple problems. Machine Learning vs. Deep Learning 8
  • 9. Big Data ➤ The “size” is constantly moving. ➤ As of 2012, ranges from 10n TB to n PB, which is 100x. ➤ Has high-3Vs: ➤ Volume, amount of data. ➤ Velocity, speed of data in and out. ➤ Variety, range of data types and sources. ➤ A practical definition: ➤ A single computer can't process in a reasonable time. ➤ Distributed computing is a big deal. 9
  • 10. Today, ➤ “Models” are the math models. ➤ “Statistical models”, emphasize inferences. ➤ “Machine learning models”, emphasize predictions. ➤ “Deep learning” and “big data” are gigantic subfields. ➤ We won't introduce. ➤ But the learning resources are listed at the end. 10
  • 11. Mosky ➤ Python Charmer at Pinkoi. ➤ Has spoken at ➤ PyCons in 
 TW, MY, KR, JP, SG, HK,
 COSCUPs, and TEDx, etc. ➤ Countless hours 
 on teaching Python. ➤ Own the Python packages: ➤ ZIPCodeTW, 
 MoSQL, Clime, etc. ➤ http://mosky.tw/ 11
  • 12. The Outline ➤ “Data” ➤ The Analysis Steps ➤ Visualization ➤ Preprocessing ➤ Dimensionality Reduction ➤ Statistical Models ➤ Machine Learning Models ➤ Keep Learning 12
  • 13. The Packages ➤ $ pip3 install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn ➤ Or ➤ > conda install jupyter numpy scipy sympy matplotlib ipython pandas seaborn statsmodels scikit-learn 13
  • 14. Common Jupyter Notebook Shortcuts 14 Esc Edit mode → command mode. Ctrl-Enter Run the cell. B Insert cell below. D, D Delete the current cell. M To Markdown cell. Cmd-/ Comment the code. H Show keyboard shortcuts. P Open the command palette.
  • 15. Checkpoint: The Packages ➤ Open 00_preface_the_packages.ipynb up. ➤ Run it. ➤ The notebooks are available on https://github.com/moskytw/ data-science-with-python. 15
  • 17. “Data” ➤ = Variables ➤ = Dimensions ➤ = Labels + Features 17
  • 18. Data in Different Types 18 Discrete Nominal {male, female} Ordinal
 Ranked ↑ & can be ordered. {great > good > fair} Continuous Interval ↑ & distance is meaningful. temperatures Ratio ↑ & 0 is meaningful. weights
  • 19. Data in the X-Y Form 19 y x dependent variable independent variable response variable explanatory variable regressand regressor endogenous variable | endog exogenous variable | exog outcome design label feature
  • 20. ➤ Confounding variables: ➤ May affect y, but not x. ➤ May lead erroneous conclusions, “garbage in, garbage out”. ➤ Controlling, e.g., fix the environment. ➤ Randomizing, e.g, choose by computer. ➤ Matching, e.g., order by gender and then assign group. ➤ Statistical control, e.g., BMI to remove height effect. ➤ Double-blind, even triple-blind trials. 20
  • 21. Get the Data ➤ Logs ➤ Existent datasets ➤ The Datasets Package – StatsModels ➤ Kaggle ➤ Experiments 21
  • 23. The Three Steps 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? 23
  • 24. 1. Define Assumption ➤ Specify a feasible objective. ➤ “Use AI to get the moon!” ➤ Write an formal assumption. ➤ “The users will buy 1% items from our recommendation.”
 rather than “The users will love our recommendation!” ➤ Note the dangerous gaps. ➤ “All the items from recommendation are free!” ➤ “Correlation does not imply causation.” ➤ Consider the next actions. ➤ “Release to 100% of users.” rather than “So great!” 24
  • 25. 2. Validate Assumption ➤ Collect potential data. ➤ List possible methods. ➤ A plotting, median, or even mean may be good enough. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Evaluate the metrics of methods with data. 25
  • 26. 3. Validated Assumption? ➤ Yes → Congrats! Report fully and take the actions! 🎉 ➤ No → Check: ➤ The hypotheses of methods. ➤ The confounding variables in data. ➤ The formality of assumption. ➤ The feasibility of objective.
 26
  • 27. Iterate Fast While Industry Changes Rapidly ➤ Resolve the small problems first. ➤ Resolve the high impact/effort problems first. ➤ One week to get a quick result and improve
 rather than one year to get the may-be-the-best result. ➤ Fail fast! 27
  • 28. Checkpoint: Pick up a Method ➤ Think of an interesting problem. ➤ E.g., revenue is higher, but is it random? ➤ Pick one method from the cheatsheets. ➤ Selecting Statistical Tests – Bates College ➤ Choosing a statistical test – HBS ➤ Choosing the right estimator – Scikit-Learn ➤ Remember the three analysis steps. 28
  • 30. Visualization ➤ Make Data Colorful – Plotting ➤ 01_1_visualization_plotting.ipynb ➤ In a Statistical Way – Descriptive Statistics ➤ 01_2_visualization_descriptive_statistics.ipynb 30
  • 31. ➤ Star98 ➤ star98_df = sm.datasets.star98.load_pandas().data ➤ Fair ➤ fair_df = sm.datasets.fair.load_pandas().data ➤ Howell1 ➤ howell1_df = pd.read_csv(
 'dataset_howell1.csv', sep=';') ➤ Or your own datasets. ➤ Plot the variables that interest you. Checkpoint: Plot the Variables 31
  • 33. Feed the Data That Models Like 33 ➤ Preprocess data for: ➤ Hard requirements, e.g., ➤ corpus → vectors ➤ “What kind of news will be voted down on PTT?” ➤ Soft requirements (hypotheses), e.g., ➤ t-test: better when samples are normally distributed. ➤ SVM: better when features range from -1 to 1. ➤ More representative features, e.g., total price / units. ➤ Note that different models have different tastes.
  • 34. Preprocessing ➤ The Dishes – Containers ➤ 02_1_preprocessing_containers.ipynb ➤ A Cooking Method – Standardization ➤ 02_2_preprocessing_standardization.ipynb ➤ Watch Out for Poisonous Data Points – Removing Outliers ➤ 02_3_preprocessing_removing_outliers.ipynb 34
  • 35. ➤ Try to standardize and compare. ➤ Try to trim the outliners. Checkpoint: Preprocess the Variables 35
  • 37. The Model Sicks Up! ➤ Let's reduce the variables. ➤ Feed a subset → feature selection. ➤ Feature selection using SelectFromModel – Scikit-Learn ➤ Feed a transformation → feature extraction. ➤ PCA, FA, etc. ➤ Another definition: non-numbers → numbers. 37
  • 38. ➤ Principal Component Analysis ➤ 03_1_dimensionality_reduction_principal_component_analysis.ipynb ➤ Factor Analysis ➤ 03_2_dimensionality_reduction_factor_analysis.ipynb Dimensionality Reduction 38
  • 39. ➤ Try to PCA(all variables) → the better components, or FA. ➤ And then plot n-dimensional data onto 2-dimensional plane. Checkpoint: Reduce the Variables 39
  • 41. Statistical Models ➤ Identify Boring or Interesting – Hypothesis Testings ➤ 04_1_statistical_models_hypothesis_testings.ipynb ➤ “Hypothesis Testing With Python” ➤ Identify X-Y Relationships – Regression ➤ 04_2_statistical_models_regression_anova.ipynb 41
  • 42. More Regression Models ➤ If y is not linear, ➤ Logit or Poisson Regression | Generalized Linear Models, GLMs ➤ If y is correlated, ➤ Linear Mixed Models, LMMs | Generalized Estimating Equation, GEE ➤ If x has multicollinearity, ➤ Lasso or Ridge Regression ➤ If error term is heteroscedastic, ➤ Weighted Least Squares, WLS | Generalized Least Squares, GLS ➤ If x is time series – predict x0 from x-1, not predict y from x, ➤ Autoregressive Integrated Moving Average, ARIMA 42
  • 43. ➤ Try to apply the analysis steps with a statistical method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Statistical Method 43
  • 45. ➤ Apple or Orange? – Classification ➤ 05_1_machine_learning_models_classification.ipynb ➤ Without Labels – Clustering ➤ 05_2_machine_learning_models_clustering.ipynb ➤ Predict the Values – Regression ➤ Who Are the Best? – Model Selection ➤ sklearn.model_selection.GridSearchCV Machine Learning Models 45
  • 46. Confusion matrix, where A = 002 = C[0, 0] 46 predicted - AC predicted + BD actual - AB true - A false + B actual + CD false - C true + D
  • 47. ➤ precision = D / BD ➤ recall = D / CD ➤ sensitivity = D / CD = recall = observed power ➤ specificity = A / AB = observed confidence level ➤ false positive rate = B / AB = observed α ➤ false negative rate= C / CD = observed β Common “rates” in confusion matrix 47
  • 48. Ensemble Models ➤ Bagging ➤ N independent models and average their output. ➤ e.g., the random forest models. ➤ Boosting ➤ N sequential models, the n model learns from n-1's error. ➤ e.g., gradient tree boosting. 48
  • 49. ➤ Try to apply the analysis steps with a ML method. 1. Define Assumption 2. Validate Assumption 3. Validated Assumption? Checkpoint: Apply a Machine Learning Method 49
  • 51. Keep Learning ➤ Statistics ➤ Seeing Theory ➤ Biological Statistics ➤ scipy.stats + StatsModels ➤ Research Methods ➤ Machine Learning ➤ Scikit-learn Tutorials ➤ Standford CS229 ➤ Hsuan-Tien Lin
 ➤ Deep Learning ➤ TensorFlow | PyTorch ➤ Standford CS231n ➤ Standford CS224n ➤ Big Data ➤ Dask ➤ Hive ➤ Spark ➤ HBase ➤ AWS 51
  • 52. The Facts ➤ ∵ ➤ You can't learn all things in the data science! ➤ ∴ ➤ “Let's learn to do” ❌ ➤ “Let's do to learn” ✅ 52
  • 53. The Learning Flow 1. Ask a question. ➤ “How to tell the differences confidently?” 2. Explore the references. ➤ “T-test, ANOVA, ...” 3. Digest into an answer. ➤ Explore by the breadth-first way. ➤ Write the code. ➤ Make it work, make it right, finally make it fast. 53
  • 54. Recap ➤ Let's do to learn, not learn to do. ➤ What is your objective? ➤ For the objective, what is your assumption? ➤ For the assumption, what method may validate it? ➤ For the method, how will you evaluate it with data? ➤ Q & A 54