SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Confidential Customized for Lorem Ipsum LLC Version 1.0
Basic of Python for
Data Analysis
Pramod Toraskar.
Why learn Python for data analysis?
Here are some reasons which go in favour of learning Python:
● Open Source – free to install
● Awesome online community
● Very easy to learn
● Can become a common language for data science and production of web based analytics products.
Choosing a development environment
1
Terminal / Shell based
2
IDLE (default environment)
3
iPython notebook – similar to markdown in
R
iPython environment - jupyter
http://jupyter-notebook-beginner-
guide.readthedocs.io/en/latest/install.html
Recall Python libraries and Data Structures
Lists, Strings, Tuples, Dictionary..
Following are a list of libraries, you will need for any scientific computations and data
analysis:
● NumPy (Numerical Python). The most powerful feature of NumPy is n-dimensional array. This library
also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities
and tools for integration with other low level languages like Fortran, C and C++
● SciPy (Scientific Python). SciPy is built on NumPy. It is one of the most useful library for variety of high
level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization
and Sparse matrices.
● Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting
features inline. If you ignore the inline option, then pylab converts ipython environment to an
environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
● Pandas for structured data operations and manipulations. It is extensively used for data munging and
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Python’s usage in data scientist community.
● Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
● Statsmodels (statistical modeling), Seaborn (statistical data visualization), Bokeh (creating interactive
plots, dashboards and data applications on modern web-browsers. It empowers the user to generate
elegant and concise graphics in the style of D3.js.)
Key phases
The 3 key phases
01
Data Exploration:
Finding out more about the data we have
● numpy
● matplotlib
● Pandas
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv")
#Reading the dataset in a dataframe using Pandas
QUICK TIP
Try right clicking on a photo and
using "Replace Image" to show
your own photo.
Data
Exploration
Once you have read the dataset, you can have a look at few top rows by
using the function head()
df.head(10)
The 3 key phases
02
Data Munging:
Cleaning the data and playing with it to make it better suit statistical
modeling.
1. There are missing values in some variables. We should
estimate those values wisely depending on the amount of
missing values and the expected importance of variables.
1. While looking at the distributions, we saw that Applicant
Income and Loan Amount seemed to contain extreme values
at either end. Though they might make intuitive sense, but
should be treated appropriately.
Check missing
values in the
dataset
Let us look at missing values in all the variables because most of the models
don’t work with missing data and even if they do, imputing them helps more
often than not. So, let us check the number of nulls / NaNs in the dataset
df.apply(lambda x: sum(x.isnull()),axis=0)
The 3 key phases
03
Predictive Modeling:
Running the actual algorithms and having fun
After, we have made the data useful for modeling, The Skicit-
Learn (sklearn) is the most commonly used library in Python
for this purpose
Building a
Predictive
Model in Python
sklearn requires all inputs to be numeric, we should convert all our
categorical variables into numeric by encoding the categories.
This can be done using the following code:
from sklearn.preprocessingimport LabelEncoder
var_mod =
['Gender','Married','Dependents','Education','Self_Employed','Property_Are
a','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes
Model’s
Logistic
Regression
Is a classification algorithm
Decision Tree
is a type of supervised
learning algorithm (having a
pre-defined target variable)
that is mostly used in
classification problems.
Random Forest
Is a versatile machine learning
method capable of performing
both regression and
classification tasks.
Thank you.

Weitere ähnliche Inhalte

Was ist angesagt?

Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 

Was ist angesagt? (20)

Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Reading Data into R
Reading Data into RReading Data into R
Reading Data into R
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using python
 
Pandas
PandasPandas
Pandas
 
Introduction to matplotlib
Introduction to matplotlibIntroduction to matplotlib
Introduction to matplotlib
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Tableau
TableauTableau
Tableau
 
1 seaborn introduction
1 seaborn introduction 1 seaborn introduction
1 seaborn introduction
 
Introduction to Tableau
Introduction to TableauIntroduction to Tableau
Introduction to Tableau
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
Data visualization using R
Data visualization using RData visualization using R
Data visualization using R
 

Ähnlich wie Basic of python for data analysis

Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
rohithprabhas1
 

Ähnlich wie Basic of python for data analysis (20)

Python ml
Python mlPython ml
Python ml
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Session 2
Session 2Session 2
Session 2
 
Five python libraries should know for machine learning
Five python libraries should know for machine learningFive python libraries should know for machine learning
Five python libraries should know for machine learning
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Python libraries
Python librariesPython libraries
Python libraries
 
Intellectual technologies
Intellectual technologiesIntellectual technologies
Intellectual technologies
 
First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python Programming
 
Python for ML
Python for MLPython for ML
Python for ML
 
housing price prediction ppt in artificial
housing price prediction ppt in artificialhousing price prediction ppt in artificial
housing price prediction ppt in artificial
 
Introduction_to_Python.pptx
Introduction_to_Python.pptxIntroduction_to_Python.pptx
Introduction_to_Python.pptx
 
summer training report on python
summer training report on pythonsummer training report on python
summer training report on python
 
Python libraries for data science
Python libraries for data sciencePython libraries for data science
Python libraries for data science
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Top 11 python frameworks for machine learning and deep learning
Top 11 python frameworks for machine learning and deep learningTop 11 python frameworks for machine learning and deep learning
Top 11 python frameworks for machine learning and deep learning
 

Kürzlich hochgeladen

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 

Basic of python for data analysis

  • 1. Confidential Customized for Lorem Ipsum LLC Version 1.0 Basic of Python for Data Analysis Pramod Toraskar.
  • 2. Why learn Python for data analysis? Here are some reasons which go in favour of learning Python: ● Open Source – free to install ● Awesome online community ● Very easy to learn ● Can become a common language for data science and production of web based analytics products.
  • 3. Choosing a development environment 1 Terminal / Shell based 2 IDLE (default environment) 3 iPython notebook – similar to markdown in R iPython environment - jupyter http://jupyter-notebook-beginner- guide.readthedocs.io/en/latest/install.html
  • 4. Recall Python libraries and Data Structures Lists, Strings, Tuples, Dictionary.. Following are a list of libraries, you will need for any scientific computations and data analysis: ● NumPy (Numerical Python). The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++ ● SciPy (Scientific Python). SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
  • 5. ● Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot. ● Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community. ● Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. ● Statsmodels (statistical modeling), Seaborn (statistical data visualization), Bokeh (creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of D3.js.)
  • 7. The 3 key phases 01 Data Exploration: Finding out more about the data we have ● numpy ● matplotlib ● Pandas import pandas as pd import numpy as np import matplotlib as plt df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv") #Reading the dataset in a dataframe using Pandas QUICK TIP Try right clicking on a photo and using "Replace Image" to show your own photo.
  • 8. Data Exploration Once you have read the dataset, you can have a look at few top rows by using the function head() df.head(10)
  • 9. The 3 key phases 02 Data Munging: Cleaning the data and playing with it to make it better suit statistical modeling. 1. There are missing values in some variables. We should estimate those values wisely depending on the amount of missing values and the expected importance of variables. 1. While looking at the distributions, we saw that Applicant Income and Loan Amount seemed to contain extreme values at either end. Though they might make intuitive sense, but should be treated appropriately.
  • 10. Check missing values in the dataset Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset df.apply(lambda x: sum(x.isnull()),axis=0)
  • 11. The 3 key phases 03 Predictive Modeling: Running the actual algorithms and having fun After, we have made the data useful for modeling, The Skicit- Learn (sklearn) is the most commonly used library in Python for this purpose
  • 12. Building a Predictive Model in Python sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code: from sklearn.preprocessingimport LabelEncoder var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Are a','Loan_Status'] le = LabelEncoder() for i in var_mod: df[i] = le.fit_transform(df[i]) df.dtypes
  • 13. Model’s Logistic Regression Is a classification algorithm Decision Tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. Random Forest Is a versatile machine learning method capable of performing both regression and classification tasks.