SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Python Toolboxes for
Data Scientists
BY
UBAH JOHNSON C
Introduction
The toolbox of any data scientist, as for any kind of programmer, is an
essential ingredient for success and enhanced performance. Choosing the
right tools can save a lot of time and thereby allow us to focus on data
analysis.
Python is a programming language with excellent properties for new
programmers making it ideal for people who have never programmed
before.
Python Properties
Easy to read codes
Suppression of non-mandatory delimiters
Dynamic typing
Dynamic memory usage
It is an interpreted language
Python libraries for data scientists
The most popular python toolboxes for any data scientist are
Numpy
Scipy
Pandas and
Scikit-Learn
Scikit-learn:
Scikit-learn is a machine learning library built from Numpy, Scipy and
Matplotlib.
It offers simple and efficient tools for data analysis such as
classification, regression, clustering, dimensionality reduction, model
selection and preprocessing.
Pandas: Python data analysis library
Pandas provides high-performance data structure and data analysis tools.
Its main feature is a fast and efficient Dataframe object for data manipulation with
integrated indexing.
DataFrame acts as a spreadsheet. It provides functions for aggregating, merging, and
joining dataset.
it also has tools for exporting and importing different file formats. Such as CSV, text
files, Microsoft excel, SQL databases, and fast HDF5 formats.
In many situations, the data you have in such formats will
not be complete or totally structured. For such cases,
Pandas offers handling of missing data and intelligent data
alignment. Furthermore, Pandas provides a convenient
Matplotlib interface.
Installations for Data sciences
We have two options for installation after choosing which python version
(Python 2.X and Python 3.X) to work with. They are;
Installing the data scientist python ecosystem by individual tool boxes or
To perform a bundle installation with all the needed toolboxes
For newbies, the second option is recommended, hence Anaconda python
distribution is a good option.
The Anaconda distribution provides integration of all the Python toolboxes and
applications needed for data scientists into a single directory without mixing it with
other Python toolboxes installed on the machine.
It contains, of course, the core toolboxes and applications such as NumPy, Pandas,
SciPy, Matplotlib, Scikit-learn, IPython, Spyder, etc., but also more specific tools for
other related tasks such as data visualization, code optimization, and big data
processing.
Integrated Development Environment for
python
The integrated development environment (IDE) is an essential tool for programmers
and data scientists.
Spyder (Scientific Python Development EnviRonment) is an IDE customized with the
task of the data scientist in mind.
Web Integrated Development Environment (WIDE): Jupyter.
Jupyter (for Julia, Python and R) aims to reuse the same WIDE for all these interpreted
languages and not just Python
Jupyter Notebook
It will take you through the jupyter notebook opening process
If we chose the bundle installation, we can start the Jupyter notebook platform by clicking on
the Jupyter Notebook icon installed by Anaconda in the start menu or on the desktop.
The browser will immediately be launched displaying the Jupyter notebook homepage, whose
URL is http://localhost:8888/tree.
So we need to press the new button at the top right of the homepage.
Then rename the notebook
Jupyter notebook
Let us begin by importing those toolboxes that we will need for our program.
To execute just one cell, we press the button or click on Cell Run or press the keys
Ctrl+Enter.
The DataFrame Data Structure
The key data structure in Pandas is the DataFrameobject.
A DataFrame is basically a tabular data structure, with rows and columns. Rows have a
specific index to access them, which can be any name or value. In Pandas, the columns
are called Series, a special type of data, which in essence consists of a list of several
values, where each value has an index.
Therefore, the DataFrame data structure can be seen as a spreadsheet, but it is much
more flexible
Dataframe
we will create a new cell by clicking Insert, Insert Cell Below or pressing the keysCtrl+B. Then, we
write in the following code:
We use the pandas DataFrame object constructor with a dictionary of lists as
argument. The value of each entry in the dictionary is the name of the column,
and the lists are their values.
The DataFrame columns can be arranged at construction time by entering a
keyword columns with a list of the names of the columns ordered as we want.
If the column keyword is not present in the constructor, the columns will be
arranged in alphabetical order.
Now, if we execute this cell, the result
will be a table like this:
Reading files using Pandas
For files to be read, ensure that the file is stored in the same directory as our notebook
directory.
Output of the file read
The way to read CSV (or any other separated value, providing the
separator character) files in Pandas is by calling the read_csv method.
Besides the name of the file, we add the na_values key argument to this
method along with the character that represents “non available data” in
the file. Normally, CSV files have a header with the names of the columns.
If this is the case, we can use the usecols parameter to select which
columns in the file will be used.
Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated
files, or even the content from the clipboard (read_excel(), read_hdf (), read_table (),
read_clipboard ()).
Whichever function we use, the result of reading a file is stored as a DataFrame
structure.
we can use thehead()method, which shows just the first five rows. If we use a number
as an argument to this method, this will be the number of rows that will be listed:
The head() method
The Tail() method
Describe() method
Used to show a quick statistical information on all the numeric column in a
DataFrame.
The result shows the count, the mean, the standard deviation, the minimum and
maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the
values in each column or series.
Selecting Data
If we want to select a subset of data from a DataFrame, it is necessary to
indicate this subset using square brackets ([]) after the DataFrame.
Edu[‘value’] for selecting columns
Edu[10:14] for selecting rows
Edu.ix[90:94, [‘time’,’geo’]] for selecting a subset of rows and columns.
Filtering Data
Another way to select a subset of data is by applying Boolean indexing. This
indexing is commonly known as a filter. For instance, if we want to filter those
values less than or equal to 6.5, we can do it like this:
Edu[edu[‘value’]>6.5].tail()
The following Boolean operators can be used for filtering:< (less than),<=(less
than or equal to),>(greater than),>=(greater than or equal to),=(equal to),
and!=(not equal to)
Filtering missing values
Pandas uses the special value NaN(not a number) to represent missing values. In
Python, NaN is a special floating-point value returned by certain operations
when one of their results ends in an undefined value.
A subtle feature of NaN values is that two NaN are never equal. Because of this,
the only safe way to tell whether a value is missing in a DataFrame is by using
the isnull()function. Indeed, this function can be used to filter rows with missing
values
Filtering missing values
List of Aggregation functions
Manipulating Data
We can check the maximum and minimum values using .min(axis=0 or 1) or .max(axis=0
or 1) function.
Print “Pandas max function:”, edu[‘value’].max()
we can apply any binary arithmetical operation (+,-,*,/) to an entire row in a column. E.g
s =edu[‘value’]/100
we apply the sqrt function from the NumPy library to perform the square root of each
value in the Value column. E.g s=edu[‘value’].apply(np.sqrt)
If we need to design a specific function to apply it, we can write an in-line
function, commonly known as a λ-function. A λ-function is a function without a
name. It is only necessary to specify the parameters it receives, between the
lambda keyword and the colon (:). In the next example, only one parameter is
needed, which will be the value of each element in the Value column.
Another basic manipulation operation is to set new values in our DataFrame. This can be done
directly using the assign operator (=) over a DataFrame.
Now, if we want to remove this column from the DataFrame, we can use the drop function; this
removes the indicated rows if axis=0, or the indicated columns if axis=1
Manipulating Data
oEdu.drop(‘valueNorm’, axis=1, inplace =True)
oUse the append function to add a new row at the buttom of the DataFrame.
oEdu =edu.append({“time”:2000, “Value”:5.00,”Geo”:’a’} ignore_index=True)
oDropping a row: edu.drop(max(edu.index).axus=0, inplace=True)
oThe drop function is also used to remove missing values by applying it over the isnull() function
oeduDrop = edu.drop(edu[“value”].isnull(), axis=0)
oeduDrop.head()
Sorting
Another important functionality we will need when inspecting our data is to sort by columns.
For sorting in ascending order we use the code
For sorting in descending order we use this code
Grouping Data
Pandas groupby () function allows us to group data.
To have a proper DataFrame as a result, it is necessary to apply an aggregation
function. Thus, this function will be applied to all the values in the same group.
Rearranging Data
We can transform the arrangement of our data, redistributing the indexes and
columns for better manipulation of our data, which normally leads to better
performance. We can rearrange our data using the pivot_table function. Here,
we can specify which columns will be the new indexes, the new values, and the
new columns.
Now we can use the new index to select specific rows by label, using the ix operator:
Ranking Data
Another useful visualization feature is to rank data. For example, we would like to know
how each country is ranked by year. To see this, we will use the pandas rank function.
But first, we need to clean up our previous pivoted table a bit so that it only has real
countries with real data.
To do this, first we drop the Euro area entries and shorten the Germany name entry,
using there name function and then we drop all the rows containing any NaN, using the
dropna function.
Ranking
Plotting Data
Pandas Dataframes and series can be plotted using the plot() function which
uses the library for graphics, Matplotlib
Plotting Data
It is also possible to plot a DataFrame directly. In this case, each column is
treated as a separated Series. For example, instead of printing the accumulated
value over the years, we can plot the value for each year.
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt? (20)

9 python data structure-2
9 python data structure-29 python data structure-2
9 python data structure-2
 
Introduction To R Language
Introduction To R LanguageIntroduction To R Language
Introduction To R Language
 
Data structures and algorithms
Data structures and algorithmsData structures and algorithms
Data structures and algorithms
 
Chapter 15 Lists
Chapter 15 ListsChapter 15 Lists
Chapter 15 Lists
 
R training2
R training2R training2
R training2
 
LectureNotes-05-DSA
LectureNotes-05-DSALectureNotes-05-DSA
LectureNotes-05-DSA
 
Arrays in Data Structure and Algorithm
Arrays in Data Structure and Algorithm Arrays in Data Structure and Algorithm
Arrays in Data Structure and Algorithm
 
Basic data-structures-v.1.1
Basic data-structures-v.1.1Basic data-structures-v.1.1
Basic data-structures-v.1.1
 
Lecture-05-DSA
Lecture-05-DSALecture-05-DSA
Lecture-05-DSA
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 
Python data type
Python data typePython data type
Python data type
 
Data structure using c++
Data structure using c++Data structure using c++
Data structure using c++
 
Ap Power Point Chpt9
Ap Power Point Chpt9Ap Power Point Chpt9
Ap Power Point Chpt9
 
Data Structures (CS8391)
Data Structures (CS8391)Data Structures (CS8391)
Data Structures (CS8391)
 
Data structure
Data structureData structure
Data structure
 
Chapter 17 Tuples
Chapter 17 TuplesChapter 17 Tuples
Chapter 17 Tuples
 
Introduction to Data Structure
Introduction to Data Structure Introduction to Data Structure
Introduction to Data Structure
 
Elementary data structure
Elementary data structureElementary data structure
Elementary data structure
 
9781439035665 ppt ch09
9781439035665 ppt ch099781439035665 ppt ch09
9781439035665 ppt ch09
 
DATA STRUCTURES USING C -ENGGDIGEST
DATA STRUCTURES USING C -ENGGDIGESTDATA STRUCTURES USING C -ENGGDIGEST
DATA STRUCTURES USING C -ENGGDIGEST
 

Ähnlich wie Lecture 3 intro2data

Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018DataLab Community
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxParveenShaik21
 
XII IP New PYTHN Python Pandas 2020-21.pptx
XII IP New PYTHN Python Pandas 2020-21.pptxXII IP New PYTHN Python Pandas 2020-21.pptx
XII IP New PYTHN Python Pandas 2020-21.pptxlekha572836
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxprakashvs7
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxPremaGanesh1
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
pandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdfpandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdfscorsam1
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using pythonPurna Chander
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science JobRohit Dubey
 

Ähnlich wie Lecture 3 intro2data (20)

Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
 
Unit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptxUnit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptx
 
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptxPython-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
Unit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptxUnit 3_Numpy_VP.pptx
Unit 3_Numpy_VP.pptx
 
XII IP New PYTHN Python Pandas 2020-21.pptx
XII IP New PYTHN Python Pandas 2020-21.pptxXII IP New PYTHN Python Pandas 2020-21.pptx
XII IP New PYTHN Python Pandas 2020-21.pptx
 
Unit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptxUnit 3_Numpy_Vsp.pptx
Unit 3_Numpy_Vsp.pptx
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Data Management in Python
Data Management in PythonData Management in Python
Data Management in Python
 
Pandas.pptx
Pandas.pptxPandas.pptx
Pandas.pptx
 
pandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdfpandas-221217084954-937bb582.pdf
pandas-221217084954-937bb582.pdf
 
Data engineering and analytics using python
Data engineering and analytics using pythonData engineering and analytics using python
Data engineering and analytics using python
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Congrats ! You got your Data Science Job
Congrats ! You got your Data Science JobCongrats ! You got your Data Science Job
Congrats ! You got your Data Science Job
 
Python for data analysis
Python for data analysisPython for data analysis
Python for data analysis
 

Mehr von Johnson Ubah

Supervised learning
Supervised learningSupervised learning
Supervised learningJohnson Ubah
 
Statistical inference with Python
Statistical inference with PythonStatistical inference with Python
Statistical inference with PythonJohnson Ubah
 
OSI reference Model
OSI reference ModelOSI reference Model
OSI reference ModelJohnson Ubah
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data scienceJohnson Ubah
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learningJohnson Ubah
 
Network and computer forensics
Network and computer forensicsNetwork and computer forensics
Network and computer forensicsJohnson Ubah
 

Mehr von Johnson Ubah (7)

Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Statistical inference with Python
Statistical inference with PythonStatistical inference with Python
Statistical inference with Python
 
IP Addressing
IP AddressingIP Addressing
IP Addressing
 
OSI reference Model
OSI reference ModelOSI reference Model
OSI reference Model
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Network and computer forensics
Network and computer forensicsNetwork and computer forensics
Network and computer forensics
 

Kürzlich hochgeladen

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Kürzlich hochgeladen (20)

Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

Lecture 3 intro2data

  • 1. Python Toolboxes for Data Scientists BY UBAH JOHNSON C
  • 2. Introduction The toolbox of any data scientist, as for any kind of programmer, is an essential ingredient for success and enhanced performance. Choosing the right tools can save a lot of time and thereby allow us to focus on data analysis. Python is a programming language with excellent properties for new programmers making it ideal for people who have never programmed before.
  • 3. Python Properties Easy to read codes Suppression of non-mandatory delimiters Dynamic typing Dynamic memory usage It is an interpreted language
  • 4. Python libraries for data scientists The most popular python toolboxes for any data scientist are Numpy Scipy Pandas and Scikit-Learn
  • 5. Scikit-learn: Scikit-learn is a machine learning library built from Numpy, Scipy and Matplotlib. It offers simple and efficient tools for data analysis such as classification, regression, clustering, dimensionality reduction, model selection and preprocessing.
  • 6. Pandas: Python data analysis library Pandas provides high-performance data structure and data analysis tools. Its main feature is a fast and efficient Dataframe object for data manipulation with integrated indexing. DataFrame acts as a spreadsheet. It provides functions for aggregating, merging, and joining dataset. it also has tools for exporting and importing different file formats. Such as CSV, text files, Microsoft excel, SQL databases, and fast HDF5 formats.
  • 7. In many situations, the data you have in such formats will not be complete or totally structured. For such cases, Pandas offers handling of missing data and intelligent data alignment. Furthermore, Pandas provides a convenient Matplotlib interface.
  • 8. Installations for Data sciences We have two options for installation after choosing which python version (Python 2.X and Python 3.X) to work with. They are; Installing the data scientist python ecosystem by individual tool boxes or To perform a bundle installation with all the needed toolboxes For newbies, the second option is recommended, hence Anaconda python distribution is a good option.
  • 9. The Anaconda distribution provides integration of all the Python toolboxes and applications needed for data scientists into a single directory without mixing it with other Python toolboxes installed on the machine. It contains, of course, the core toolboxes and applications such as NumPy, Pandas, SciPy, Matplotlib, Scikit-learn, IPython, Spyder, etc., but also more specific tools for other related tasks such as data visualization, code optimization, and big data processing.
  • 10. Integrated Development Environment for python The integrated development environment (IDE) is an essential tool for programmers and data scientists. Spyder (Scientific Python Development EnviRonment) is an IDE customized with the task of the data scientist in mind. Web Integrated Development Environment (WIDE): Jupyter. Jupyter (for Julia, Python and R) aims to reuse the same WIDE for all these interpreted languages and not just Python
  • 11. Jupyter Notebook It will take you through the jupyter notebook opening process If we chose the bundle installation, we can start the Jupyter notebook platform by clicking on the Jupyter Notebook icon installed by Anaconda in the start menu or on the desktop. The browser will immediately be launched displaying the Jupyter notebook homepage, whose URL is http://localhost:8888/tree. So we need to press the new button at the top right of the homepage. Then rename the notebook
  • 12. Jupyter notebook Let us begin by importing those toolboxes that we will need for our program. To execute just one cell, we press the button or click on Cell Run or press the keys Ctrl+Enter.
  • 13. The DataFrame Data Structure The key data structure in Pandas is the DataFrameobject. A DataFrame is basically a tabular data structure, with rows and columns. Rows have a specific index to access them, which can be any name or value. In Pandas, the columns are called Series, a special type of data, which in essence consists of a list of several values, where each value has an index. Therefore, the DataFrame data structure can be seen as a spreadsheet, but it is much more flexible
  • 14. Dataframe we will create a new cell by clicking Insert, Insert Cell Below or pressing the keysCtrl+B. Then, we write in the following code:
  • 15. We use the pandas DataFrame object constructor with a dictionary of lists as argument. The value of each entry in the dictionary is the name of the column, and the lists are their values. The DataFrame columns can be arranged at construction time by entering a keyword columns with a list of the names of the columns ordered as we want. If the column keyword is not present in the constructor, the columns will be arranged in alphabetical order.
  • 16. Now, if we execute this cell, the result will be a table like this:
  • 17. Reading files using Pandas For files to be read, ensure that the file is stored in the same directory as our notebook directory.
  • 18. Output of the file read
  • 19. The way to read CSV (or any other separated value, providing the separator character) files in Pandas is by calling the read_csv method. Besides the name of the file, we add the na_values key argument to this method along with the character that represents “non available data” in the file. Normally, CSV files have a header with the names of the columns. If this is the case, we can use the usecols parameter to select which columns in the file will be used.
  • 20. Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated files, or even the content from the clipboard (read_excel(), read_hdf (), read_table (), read_clipboard ()). Whichever function we use, the result of reading a file is stored as a DataFrame structure. we can use thehead()method, which shows just the first five rows. If we use a number as an argument to this method, this will be the number of rows that will be listed:
  • 23. Describe() method Used to show a quick statistical information on all the numeric column in a DataFrame. The result shows the count, the mean, the standard deviation, the minimum and maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values in each column or series.
  • 24.
  • 25. Selecting Data If we want to select a subset of data from a DataFrame, it is necessary to indicate this subset using square brackets ([]) after the DataFrame. Edu[‘value’] for selecting columns Edu[10:14] for selecting rows Edu.ix[90:94, [‘time’,’geo’]] for selecting a subset of rows and columns.
  • 26. Filtering Data Another way to select a subset of data is by applying Boolean indexing. This indexing is commonly known as a filter. For instance, if we want to filter those values less than or equal to 6.5, we can do it like this: Edu[edu[‘value’]>6.5].tail()
  • 27. The following Boolean operators can be used for filtering:< (less than),<=(less than or equal to),>(greater than),>=(greater than or equal to),=(equal to), and!=(not equal to)
  • 28. Filtering missing values Pandas uses the special value NaN(not a number) to represent missing values. In Python, NaN is a special floating-point value returned by certain operations when one of their results ends in an undefined value. A subtle feature of NaN values is that two NaN are never equal. Because of this, the only safe way to tell whether a value is missing in a DataFrame is by using the isnull()function. Indeed, this function can be used to filter rows with missing values
  • 30. List of Aggregation functions
  • 31. Manipulating Data We can check the maximum and minimum values using .min(axis=0 or 1) or .max(axis=0 or 1) function. Print “Pandas max function:”, edu[‘value’].max() we can apply any binary arithmetical operation (+,-,*,/) to an entire row in a column. E.g s =edu[‘value’]/100 we apply the sqrt function from the NumPy library to perform the square root of each value in the Value column. E.g s=edu[‘value’].apply(np.sqrt)
  • 32. If we need to design a specific function to apply it, we can write an in-line function, commonly known as a λ-function. A λ-function is a function without a name. It is only necessary to specify the parameters it receives, between the lambda keyword and the colon (:). In the next example, only one parameter is needed, which will be the value of each element in the Value column.
  • 33. Another basic manipulation operation is to set new values in our DataFrame. This can be done directly using the assign operator (=) over a DataFrame. Now, if we want to remove this column from the DataFrame, we can use the drop function; this removes the indicated rows if axis=0, or the indicated columns if axis=1
  • 34. Manipulating Data oEdu.drop(‘valueNorm’, axis=1, inplace =True) oUse the append function to add a new row at the buttom of the DataFrame. oEdu =edu.append({“time”:2000, “Value”:5.00,”Geo”:’a’} ignore_index=True) oDropping a row: edu.drop(max(edu.index).axus=0, inplace=True) oThe drop function is also used to remove missing values by applying it over the isnull() function oeduDrop = edu.drop(edu[“value”].isnull(), axis=0) oeduDrop.head()
  • 35. Sorting Another important functionality we will need when inspecting our data is to sort by columns. For sorting in ascending order we use the code For sorting in descending order we use this code
  • 36. Grouping Data Pandas groupby () function allows us to group data. To have a proper DataFrame as a result, it is necessary to apply an aggregation function. Thus, this function will be applied to all the values in the same group.
  • 37. Rearranging Data We can transform the arrangement of our data, redistributing the indexes and columns for better manipulation of our data, which normally leads to better performance. We can rearrange our data using the pivot_table function. Here, we can specify which columns will be the new indexes, the new values, and the new columns.
  • 38. Now we can use the new index to select specific rows by label, using the ix operator:
  • 39. Ranking Data Another useful visualization feature is to rank data. For example, we would like to know how each country is ranked by year. To see this, we will use the pandas rank function. But first, we need to clean up our previous pivoted table a bit so that it only has real countries with real data. To do this, first we drop the Euro area entries and shorten the Germany name entry, using there name function and then we drop all the rows containing any NaN, using the dropna function.
  • 41. Plotting Data Pandas Dataframes and series can be plotted using the plot() function which uses the library for graphics, Matplotlib
  • 42. Plotting Data It is also possible to plot a DataFrame directly. In this case, each column is treated as a separated Series. For example, instead of printing the accumulated value over the years, we can plot the value for each year.