SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Toolboxes
for
Data Scientists
Sudipto Krishna Dutta
20204021
Introduction to Data Science
Jahangirnagar University
Introduction
 Toolbox is a box, where many different built in functions arestored.
Toolbox helps to perform a task efficiently and successfully, especially
for data scientist and programmer. Choosing right toolbox can save a
lot of time for doing a specific task within the targeted time. Using
toolboxes also help to enhance the overall performance of any kind of
task like data analysis from big data set and calculating the desired
result. For example, if we want to calculate the co-relation coefficient,
it is impossible to use a single piece of written code to handle a big
data set or extract desired information from this. So, here toolbox can
help to perform this task in an effective manner. Using toolbox we can
call different built in functions to perform the desired task and to keep
all this types of toolbox we can work simultaneously.
Different Tool and their benefits
There is different kind of tool and to use them is very beneficial for a data scientist. Toolboxes are
like
• Statistical Tool
R/Python - is used for statistical analysis.
Mean, Median, Mode, Standard Deviation are also in statistical tool.
• Mathematical Tool
SAS – Strong data analysis abilities, data management, data encryption
Matlab – A numeric computing environment, Powerful graphics librabry, Can process
complex mathematical operations.
• Database Tool
Apache Cassandra - is an open source and high scalable NoSQL database to manage
massive amount of data in a faster manner.
SQL – is a very popular and widely used but in data science it is recommended for it (i)
Flexibility, (ii) Ease of use, (iii) No redundancy and (iv) Reliability
In data science statistical toolbox is not the only toolbox for naming the data science it also need
mathematical calculations/functions and database to read or write the task. So, all of them
together can
be called the data science. A lot of benefits lie in toolboxes for any data scientist. Here is some
tasks,
those can be done by using toolbox. Like
• Big data analysis
• Handling massive volume of data
• Collection a large scale of data set
• Building a structure for the operational data
• Make a pattern and derive the valuable insights from chosen data set.
Toolbox’s’ advantages over the other programming languages
and Similarity among them
 There are many programming language, like C, Fortran, C++,JAVA etc which are generally
used for developing high-performance production or prototyping any kind of certain task or
project but the problem is in those language many basic tools are not available or to re-
implement those things again and again. So the advantages of toolboxes over the
programming languages like
 It has a number of built in function which can be use anywhere in the code by just calling
them.
 It is not needed to write specific code for the specific task by using toolbox because we can
perform the needed task by calling the specific function which is stored in toolbox.
 We can avoid re-complications of anything for introducing any kind of new function in the
task.
 Easy and all the basic functions are available in toolbox.
 To find out the similarity among them we can identify some basic similarity in the working
procedure. In toolbox, a collection of built in functions are there as well as in the
programming language it has also owned some function by declaring them in the code. To
generate a task by using toolbox we can call the specific built in function in the code and in
programming language the needed function need to be written to complete the same task. If
we consider the performance to generate the task we can find out some similarity among
them. Both can be used for developing high performance production and prototyping and
building a data structure. In environmental perspective both have similarity and both are
supported object oriented programming. Both has basic statements for functional
programming in its own core library.
Why python is the best choice ?
• Python is a widely used and very popular programming language to all. Even it has
great properties for who is new to write computer program or even who never
programmed. Though, python has the features for doing data science task more
effectively and as we know that data science is not only about the statistical
function it also owned the mathematical and database function to it. So, the
combination of those three function we can call it data science and here in python
tools we can see all of them. So, it is a major reason to choose python. Otherwise
it has some most remarkable properties are easy to read code and has suppression
of non-mandatory delimiters, dynamic typing and dynamic memory usage. The
code is executed immediately in python console like IPython as Python has the
ability to interpreter language. Which can give us a richer environment to execute
python code? Flexibility is also the reason for choosing python. For this
characteristic it can be seen as multiparadigm language. Among them it has the
property to program with other languages and python also supports the object
oriented paradigm and C programming language code can be mixed with python
code and C code using cython. Python also has basic statements for functional
programming in its own core library. Large Eco system is also another major reason
for Choosing python.
Python libraries for Data Scientist and theirs usages
• Python community offers a huge number of developed toolboxes. This is
very exciting that to know most of them can be used for data science. The
most popular python toolboxes for any data scientist are
 NumPy
 SciPy
 Pandas
 Scikit- Learn
NumPy and Scipy
 NumPy is known as the basis of computing toolbox. It has served
various kind of operational functions. Though SciPy is domain
specific toolbox and it also has several functions. It has also
statistical, mathematical and database tools.
 NumPy is doing scientific computing with Python.
 It provides multidimensional arrays with basic operations on them.
 It is very useful in linear algebra function.
 Several toolboxes use the NumPy array representation as an
efficient basic data structure.
 SciPy provides collection of numeric algorithms and domain specific
toolboxes.
 SciPy can process signal and optimization and handle statistical task.
 SciPy is the plotting library Matplotlib and it has many tools for data
visualization.
SCIKIT-Learn
• It is a machine learning library built from NumPy, SciPy and
Matplotlib.
• It offers simple and efficient tools for common tasks in data
analysis such as,
 Classification
 Regression
 Clustering
 Dimensionality
 Reduction
 Model selection
 Preprocessing
Pandas
Pandas have both statistical and database tools and it also provides hard performance,
different type of tools and key features.
• It provides high performance data structure and data analyzing tools.
• It has a key feature to work fast and efficient dataframe object for data
manipulation with integrated indexing.
• The dataframe can be seen as spreadsheet which offers very flexibility.
• In pandas we can easily transform any dataset in the way we want.
• Reshaping, Adding or removing columns or rows.
• Provides high performance functions for aggregating, merging and joining
datasets.
• Pandas also has tools for importing and exporting data from different formats, like
 CSV
 Microsoft Excel
 SQL databases
 Fast HDF5 format.
Data Science Eco System
• After choosing Python, we can set up a data scientist python
ecosystem by individual toolboxes or to perform a bundle of
installation with all needed toolboxes. For those who is new
to here, It can be chosen to install the mentioned toolboxes
like Python 2.X and Python 3.X , exactly in a order.
• However if a bundle installation is chosen, the Anaconda
python distribution is the good option. Because the Anaconda
distribution provides integration of all the python toolboxes
and applications needed for the data scientist into a single
directory without mixing it with other python toolboxes
installed on the machine. The toolboxes and applications such
as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython,
Spyder..etc but more specific tools for other related tasks such
as data visualization, code, optimization and big data
processing.
IDE (Integrated Development Environments)
• The integrated development environment is software and it is very essential tool
for data scientist. IDEs is created to serve different purpose for the data scientist as
well as the programmer. Thus, over the years this software has evolved in order to
make the coding task less complicated. Selecting right IDEs for each person is very
crucial and unfortunately there is no “one size fits all” programming environment.
The best solution is to try the most popular IDE are the editor and the compiler
and the debugger. Some IDEs can be used in multiple programming language and
those provides by language specific plugins, such as NETBEANS or Eclips.
• In the case of python there are a large number if specific IDEs, both commercial
such as PyCharm and WingIDE and open source. The open source community
helps IEDs to spring up, thus anyone can customize their own environment and
share it with the rest if the community. For example Spyder (it is the Scientific
Python Development Environment) is an IDE customized with the task of the data
scientist in mind.
Data Science Eco System
• After choosing Python, we can set up a data scientist python
ecosystem by individual toolboxes or to perform a bundle of
installation with all needed toolboxes. For those who is new to
here, It can be chosen to install the mentioned toolboxes like
Python 2.X and Python 3.X , exactly in a order.
• However if a bundle installation is chosen, the Anaconda python
distribution is the good option. Because the Anaconda distribution
provides integration of all the python toolboxes and applications
needed for the data scientist into a single directory without mixing
it with other python toolboxes installed on the machine. The
toolboxes and applications such as NumPy, Pandas, SciPy,
Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific
tools for other related tasks such as data visualization, code,
optimization and big data processing.
IDE (Integrated Development Environments)
• The integrated development environment is software and it is very
essential tool for data scientist. IDEs is created to serve different purpose
for the data scientist as well as the programmer. Thus, over the years this
software has evolved in order to make the coding task less complicated.
Selecting right IDEs for each person is very crucial and unfortunately there
is no “one size fits all” programming environment. The best solution is to
try the most popular IDE are the editor and the compiler and the
debugger. Some IDEs can be used in multiple programming language and
those provides by language specific plugins, such as NETBEANS or Eclips.
• In the case of python there are a large number if specific IDEs, both
commercial such as PyCharm and WingIDE and open source. The open
source community helps IEDs to spring up, thus anyone can customize
their own environment and share it with the rest if the community. For
example Spyder (it is the Scientific Python Development Environment) is
an IDE customized with the task of the data scientist in mind.
WIDE(Web Integrated Development Environment)- Jupyter
• Python has also been developed for web application, it is a new
generation of IDEs for interactive language. Nowadays, such sessions are
called notebooks and they are not only used in classrooms but also used
to show results in presentations or on business dashboards. The Jupyter
Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and
explanatory text. Uses include: data cleaning and transformation,
numerical simulation, statistical modeling, machine learning and much
more. The recent spread of such notebooks is mainly due to IPython. Since
December 2011, IPython has been issued as a browser version of its
interactive console, called IPython notebook, which shows the Python
execution results very clearly and concisely by means of cells. Cells can
contain content other than code. For example, markdown cells can be
added to introduce algorithms. In this Jupyter Notebook it is also possible
to insert Matplotlib graphics to illustrate examples or even web pages.
IPython notebook has been separated from IPython software and now it
has become a part of a larger project. Jupyter,especiall for Julia, Python
and R that aims to reuse the same WIDE for all these interpreted
languages and not just Python. All old IPython notebooks are
automatically imported to the new version when they are opened with the
Jupyter platform.
Python, Used in Data Science
• We came to know about the python ecosystem, and the containing
Toolboxes and interactive IDEs in different that environment and their
widely uses.
The Jupyter Notebook Environment
• Here now we are discussing the Jupyter Notebook environment. we can start by
launching the Jupyter notebook platform. This can be done by simply typing the
command in terminal or command line. For example:
: $ jupyter notebook
• But if we chose the bundle installation, we can start the Jupyter notebook
platform by clicking on the Jupyter Notebook icon installed by Anaconda in the
start menu or on the desktop. If we use the command line, the root directory is
the same directory where we launched the Jupyter notebook. Otherwise, if we use
the Anaconda launcher, the root directory is the current user directory. Now, to
start a new notebook, we only need to press the
New NoteBook Python2
• Button at the top on the right of the home page. By importing those toolboxes that
we will need for our program. In the first cell we put the code to import the
Pandas library as pd. This is for convenience; every time we need to use some
functionality from the Pandas library, we will write pd instead of pandas. We will
also import the two core libraries mentioned above: the numpy library as np and
the matplotlib library as plt.
• Need to write in commands:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
• To execute just one cell, we need to press the pause sign button or to click Cell ->
Run or press the keys Ctrl + Enter. While execution is underway, the header of the
cell shows the * mark:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
• While a cell is being executed, no other cell can be executed. If we try to execute
another cell, its execution will not start until the first cell has finished its execution.
Once the execution is finished, the header of the cell will be replaced by the next
number of execution. Since this will be the first cell executed, the number shown
will be 1. If the process of importing the libraries is correct, no output cell is
produced.
import pandas as pd
import numpy as np
import matplotlin.pyplot as plt
The DataFrame Data Structer
• data structure in Pandas is the DataFrame object. A DataFrame is basically a
tabular data structure, with rows and columns. Rows have a specific index to
access them, which can be any name or value. In Pandas, the columns are
called Series, a special type of data, which in essence consists of a list of
several values, where each value has an index. Therefore, the DataFrame data
structure can be seen as a spreadsheet, but it is much more flexible. To
understand how it works, let us see how to create a DataFrame from a
common Python dictionary of lists. First, we will create a new cell by clicking
Insert -> Insert Cell Below or pressing the keys Ctrl+B
For example, the following code:
import pandas as pd
# a simple int list
list = [1,2,3,4,5]
# create series form a int list
res = pd.Series(list)
print(res)
the result will be like: 0 1
2 3
4 5
dtype: int64
import pandas as pd
dic = { 'Id': 1013, 'Name': 'Sudipto','State': 'Khulna','Age': 27}
res = pd.Series(dic)
print(res)
the result will be like: Id 1013
Name Sudipto
State Khulna
Age 27
dtype: object
Apart from DataFrame data structure creation, Panda offers a lot of functions to manipulate them. Among
other things, it offers us functions for aggregation, manipulation, and transformation of the data. In the
ollowing sections, we will introduce some of these functions.
Data Analysis Example Using Pandas
• we can use Pandas in a simple real problem, we will start doing some basic
analysis of any data. For the sake of transparency, data produced that must be
open, meaning that they can be freely used, reused, and distributed by anyone.
• Pandas is a Python library that provides extensive means for data analysis. Data
scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas
makes it very convenient to load, process, and analyze such tabular data using
SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a
wide range of opportunities for visual analysis of tabular data.
• The main data structures in Pandas are implemented
with series and dataframes classes. The former is a one-dimensional indexed
array of some fixed data type. The latter is a two-dimensional data structure -
a table - where each column contains data of the same type. You can see it as
a dictionary of Series instances. DataFrames are great for representing real
data: rows correspond to instances (examples, observations, etc.), and
columns correspond to features of these instances.
Import numpy as np
Import pandas as pd
Pd.set_option(“display.precision”, 2)
• We will demonstrate the main methods in action by analyzing a dataset on the churn
rate of telecom operator clients. Let’s read the data and take a look at the 5 lines using
the head method,
df = pd. Read_csv(“../input/telecom_churn.csv”)
df.head()
• About printing dataframe in jupyter notebooks recall that each row corresponds to
one client, an instance, and columns are features of this instance.
print(df.shape)
(3333, 20)
• From the output, we can see that the table contains 3333 rows and 20 columns.
If we want to print out the column name using columns:
print(df.columns)
• We can use the info( ) methods some genatral information about the dataframe.
print (df.info( ) )
• We see that one feature is logical (bool), 3 features are of type object, and
16 features are numeric. With this same method, we can easily see if
there are any missing values. Here, there are none because each column
contains 3333 observations, the same number of rows we saw before
with shape.
• We can change the column type with the astype method. Lets apply this to
the Churn feature to convert it in to int64:
df[“ churn ”]= df [“Churn”]. astype(“int64”)
• To describe method shows basic statistical charterstics of each numeric
feature (int 64 and float64 types): number if non-missing values, mean,
standard deviation, range, median, 0.25 and 0.75 quartiles.
df.describes( )
• In order to see statistics on non-numerical features, one has to explicitly
indicate data types of interest in the include parameter.
df.describe(include=[“object”, “bool”] )
• To delete columns or rows, use the drop method, passing the required indexes
and the axis parameter (1 if you delete columns, and nothing or 0 if you delete
rows). The inplace argument tells whether to change the original DataFrame.
With inplace=False, the drop method doesn't change the existing DataFrame
and returns a new one with dropped rows or columns. With inplace=True, it
alters the DataFrame.
#get rid of just created columns
df. Drop ([“Total charge”, “Toatal calls”], axis = 1, inplace= True)
#and here is how you can delete rows
df. Drop ([1,2]). head()
Reading Data
• To read the data from that we downloaded. First of all, we have to create a new notebook called
Open Government Data Analysis and open it. Then, after ensuring that the educ_figdp_1_Data.csv
file is stored in the same directory as our notebook directory, we will write the following code to
read and show the content:
edu = pd.read_csv (‘files/ch02/educ_figdp_1_Data.csv’, na_values = ‘ : ’, usecols =
[“TIME”,”GEO”,”VALUES”])
edu
• Beside this, Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated
files, or even the content from the clipboard
(read_excel(), read_hdf(), read_table(), read_clipboard()).
• If we want to know the names of the columns or the names of the indexes, we can use the
DataFrame attributes columns and index respectively. The names of the columns or indexes can
be changed by assigning a new list of the same length to these attributes. The values of any
DataFrame can be retrieved as a Python array by calling its values attribute. If we just want quick
statistical information on all the numeric columns in a DataFrame, we can use the function
describe(). The result shows the count, the mean, the standard deviation, the minimum and
maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values in each
column or series.
edu.describe ( )
Selecting Data
• If we want to select a subset of data from a DataFrame, it is necessary to
indicate this subset using square brackets ([ ]) after the DataFrame. The
subset can be specified in several ways. If we want to select only one
column from a DataFrame, we only need to put its name between the
square brackets. The result will be a Series data structure, not a
DataFrame, because only one column is retrieved.
edu [‘ value’]
• If we want to select a subset of rows from a DataFrame, we can do so by
indicating a range of rows separated by a colon (:) inside the square
brackets. This is commonly known as a slice of rows:
edu [ 10 : 14 ]
• For example, We assume a scenario and observe it,
• 13 2001 European Union (27 countries) 4.99 This instruction returns the
slice of rows from the 10th to the 13th position. Note that the slice does
not use the index labels as references, but the position. In this case, the
labels of the rows simply coincide with the position of the rows. If we want
to select a subset of columns and rows using the labels as our references
instead of the positions, we can use ix indexing:
edu.ix [90 : 94, [‘TIME’ , ‘GEO’] ]
Filtering Data
• Another way to select a subset of data is by applying Boolean
indexing. This indexing is commonly known as a filter. For
instance, if we want to filter those values less than or equal to
6.5, we can do it like this:
edu [ edu [‘value’] > 6.5 . tail ( )
• The Boolean operation edu[’Value’] > 6.5 produces a Boolean
mask. When an element in the “Value” column is greater than
6.5, the corresponding value in the mask is set to True,
otherwise it is set to False. Then, when this mask is applied as
an index in edu[edu[’Value’] > 6.5], the result is a filtered
DataFrame containing only rows with values higher than 6.5.
Of course, any of the usual Boolean operators can be used for
filtering:
< (less than),<= (less than or equal to), > (greater than), >=
(greater than or equal to), = (equal to), and ! = (not equal to).
Filtering Missing Values
• Pandas uses the special value NaN (not a number) to represent missing
values. In Python, NaN is a special floating-point value returned by certain
operations when one of their results ends in an undefined value. A subtle
feature of NaN values is that two NaN are never equal. Because of this,
the only safe way to tell whether a value is missing in a DataFrame is by
using the isnull() function. Indeed, this function can be used to filter rows
with missing values :
edu [edu [“value”].isnull ( ) ]. head ( )
Manipulating Data
• To manipulate data we need to know how to select the desired data. One of the
most straightforward things we can do is to operate with columns or rows using
aggregation functions. , If a function is applied to a DataFrame or a selection of
rows and columns, then you can specify if the function should be applied to the
rows for each column (setting the axis=0 keyword on the invocation of the
function), or it should be applied on the columns for each row (setting the axis=1
keyword on the invocation of the function).
edu.max ( axis = 0)
• Note that these are functions specific to Pandas, not the generic Python functions.
There are differences in their implementation. In Python, NaN values propagate
through all operations without raising an exception. In contrast, Pandas operations
exclude NaN values representing missing data. For example, the pandas max
function excludes NaN values, thus they are interpreted as missing values, while
the standard Python max function will take the mathematical interpretation of
NaN and return it as the maximum:
Input:
print “pandas max function : “ ,edu [ ‘ value ‘]. max ( )
print “pandas max function : “ ,max ( edu [ ‘ value ‘] )
Output:
Pandas max function : 8.81
Python max function: nan
• Beside these aggregation functions, we can apply operations over
all the values in rows, columns or a selection of both. The rule of
thumb is that an operation between columns means that it is
applied to each row in that column and an operation between rows
means that it is applied to each column in that row. For example we
can apply any binary arithmetical operation (+,-,*,/) to an entire
row:
Input:
S = edu [ “ Value ” ] / 100
S. head ()
Output:
0 NaN
1 Nan
2 0.0500
3 0.0503
4 0.0495
Name: Value, dtype : float64
Sorting
• This is a important functionality we will need when inspecting our data is
to sort by columns. We can sort a DataFrame using any column, using the
sort function. If we want to see the first five rows of data sorted in
descending order (i.e., from the largest to the smallest values) and using
the Value column, then we just need to do this:
edu . sort_values (by = ‘value’ , ascending = False, inplace = True )
edu. head ( )
• that the inplace keyword means that the DataFrame will be overwritten,
and hence no new DataFrame is returned. If instead of ascending = False
we use ascending = True, the values are sorted in ascending order (i.e.,
from the smallest to the largest values). If we want to return to the
original order, we can sort by an index using the sort_index function and
specifying axis=0:
edu.sort_index (axis = 0, ascending = True, inplace = True )
edu. head ( )
Ranking Data
• In statistics, “ranking” refers to the data transformation in which numerical or ordinal values are replaced by
their rank when the data are sorted. If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of
these data items would be 2, 3, 1 and 4 respectively.
• Now we can perform the ranking using the rank function. Note here that the parameter ascending=False makes the
ranking go from the highest values to the lowest values. The Pandas rank function supports different tie-breaking
methods, specified with the method parameter. In our case, we use the first method, in which ranks are assigned in
the order they appear in the array, avoiding gaps between ranking.
pivedu = pivedu.drop([
’Euro area (13 countries)’,
’Euro area (15 countries)’,
’Euro area (17 countries)’,
’Euro area (18 countries)’,
’European Union (25 countries)’,
’European Union (27 countries)’,
’European Union (28 countries)’
] ,
axis = 0)
pivedu = pivedu.rename(index = {’Germany ( until 1990 former territory of the FRG)’: ’Germany’})
pivedu = pivedu.dropna()
pivedu.rank( ascending = False , method = ’first’).head()
• If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank
the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this
way:
totalSum = pivedu. sum(axis = 1)
totalSum. rank( ascending = False , method = ’dense’) .sort_values(). head()
• If we want to make a global ranking taking into account all the years, we can
sum up all the columns and rank the result. Then we can sort the resulting
values to retrieve the top five countries for the last 6 years, in this way:
totalSum = pivedu. sum(axis = 1)
totalSum. rank( ascending = False , method =’dense’) .sort_values(). head()
Plotting
• Pandas DataFrames and Series can be plotted using the plot function, which uses
the library for graphics Matplotlib. For example, if we want to plot the accumulated
values for each country over the last 6 years, we can take the Series obtained in the
previous example and plot it directly by calling the plot function as shown in the
next cell:
totalSum = pivedu. sum(axis = 1) .sort_values(ascending = False)
totalSum. plot(kind = ’bar’, style = ’b’, alpha = 0.4, title = "Total Values for Country")
• It is also possible to plot a DataFrame directly. In this case, each column is treated
as a separated Series. For example, instead of printing the accumulated value over
the years, we can plot the value for each year.
my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’]
ax = pivedu. plot(kind = ’barh’,
stacked = True ,
color = my_colors)
ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5)
Why ToolBox is improved version of Sub functional language
• ToolBox offers features over the other programming language
and the toolboxes are the updated and improved version of any
kind of sub functional language. Because ToolBox has all the
feature including other programming language. In toolbox we
have all function to perform when it is needed but in other
programming language those all features are not available in
package like TollBox. We can call the built in functions, which are
stored in toolbox at anywhere in the programmer without
fetching any kind of complication or error. But in other sub
functional programming language does not offer those kind of
built in function, there we need to declare the function to
perform it. But sometimes it shows that this declared functions
shows different kinds of error like missing arguments or
functional error. So after considering all the resources, we can
make sure that ToolaBox is definitely the improved version of
other programming/Sub functional language.
Conclusion
• Data Science is like the sea and the tools that data scientist use is like the
elements inside the sea water. So, to handle this massive task we need a
complete package to run it efficiently. Data Scientist handles this in a
smart manner like ToolBox. It helps data scientist to work more efficiently
and obviously considering the performance. We must to say about the
Python’s ecosystem to have all those things. It offers a perfect way to
perform like a pro. Python ecosystem offers a complete package to a data
scientist to lead the task in a efficient manner for developing any data
scientist projects.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
[pycon korea 2018] Automating network device test with python - dongsoo.koo
[pycon korea 2018] Automating network device test with python - dongsoo.koo[pycon korea 2018] Automating network device test with python - dongsoo.koo
[pycon korea 2018] Automating network device test with python - dongsoo.koo동수 구
 
task manager presentation in Operating System
task manager presentation in Operating System task manager presentation in Operating System
task manager presentation in Operating System FariaChaudhry6
 
Introduction to python for Beginners
Introduction to python for Beginners Introduction to python for Beginners
Introduction to python for Beginners Sujith Kumar
 
Memory Organization of a Computer System
Memory Organization of a Computer SystemMemory Organization of a Computer System
Memory Organization of a Computer SystemTaminul Islam
 
Chapter 01 introduction to Computer
Chapter 01 introduction to ComputerChapter 01 introduction to Computer
Chapter 01 introduction to ComputerHareem Aslam
 
RedHat Linux
RedHat LinuxRedHat Linux
RedHat LinuxApo
 
Top Libraries for Machine Learning with Python
Top Libraries for Machine Learning with Python Top Libraries for Machine Learning with Python
Top Libraries for Machine Learning with Python Chariza Pladin
 
Presentation on computer language
Presentation on computer languagePresentation on computer language
Presentation on computer languageSwarnima Tiwari
 
Python PPT
Python PPTPython PPT
Python PPTEdureka!
 
CBSE Class-5 lesson 1 Introduction to Early Computers
CBSE Class-5 lesson 1 Introduction to Early ComputersCBSE Class-5 lesson 1 Introduction to Early Computers
CBSE Class-5 lesson 1 Introduction to Early Computersswathivinod
 

Was ist angesagt? (20)

Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Python final ppt
Python final pptPython final ppt
Python final ppt
 
Memory Hierarchy
Memory HierarchyMemory Hierarchy
Memory Hierarchy
 
[pycon korea 2018] Automating network device test with python - dongsoo.koo
[pycon korea 2018] Automating network device test with python - dongsoo.koo[pycon korea 2018] Automating network device test with python - dongsoo.koo
[pycon korea 2018] Automating network device test with python - dongsoo.koo
 
Lecture 1 fundamentals of computer
Lecture 1   fundamentals of computerLecture 1   fundamentals of computer
Lecture 1 fundamentals of computer
 
task manager presentation in Operating System
task manager presentation in Operating System task manager presentation in Operating System
task manager presentation in Operating System
 
Virtual box
Virtual boxVirtual box
Virtual box
 
Introduction to python for Beginners
Introduction to python for Beginners Introduction to python for Beginners
Introduction to python for Beginners
 
Intel i7
Intel i7Intel i7
Intel i7
 
Memory Organization of a Computer System
Memory Organization of a Computer SystemMemory Organization of a Computer System
Memory Organization of a Computer System
 
Operating systems1[1]
Operating systems1[1]Operating systems1[1]
Operating systems1[1]
 
Chapter 01 introduction to Computer
Chapter 01 introduction to ComputerChapter 01 introduction to Computer
Chapter 01 introduction to Computer
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
RedHat Linux
RedHat LinuxRedHat Linux
RedHat Linux
 
Top Libraries for Machine Learning with Python
Top Libraries for Machine Learning with Python Top Libraries for Machine Learning with Python
Top Libraries for Machine Learning with Python
 
Computer memory
Computer memoryComputer memory
Computer memory
 
Presentation on computer language
Presentation on computer languagePresentation on computer language
Presentation on computer language
 
Python PPT
Python PPTPython PPT
Python PPT
 
Python/Django Training
Python/Django TrainingPython/Django Training
Python/Django Training
 
CBSE Class-5 lesson 1 Introduction to Early Computers
CBSE Class-5 lesson 1 Introduction to Early ComputersCBSE Class-5 lesson 1 Introduction to Early Computers
CBSE Class-5 lesson 1 Introduction to Early Computers
 

Ähnlich wie Toolboxes for data scientists

Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guidepriyanka rajput
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?SofiaCarter4
 
Class 12th IP project on buisness management
Class 12th IP project on buisness managementClass 12th IP project on buisness management
Class 12th IP project on buisness managementsankhlasheetal3
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxhkabir55
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
Top 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App DevelopmentTop 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App DevelopmentInexture Solutions
 
Study of Various Tools for Data Science
Study of Various Tools for Data ScienceStudy of Various Tools for Data Science
Study of Various Tools for Data ScienceIRJET Journal
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studioDerek Kane
 
Breast Cancer Prediction.pdf
Breast Cancer Prediction.pdfBreast Cancer Prediction.pdf
Breast Cancer Prediction.pdfSouravNaga2
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020Marié Roux
 
Python vs. r for data science
Python vs. r for data sciencePython vs. r for data science
Python vs. r for data scienceHugo Shi
 

Ähnlich wie Toolboxes for data scientists (20)

Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?What Is The Future of Data Science With Python?
What Is The Future of Data Science With Python?
 
Class 12th IP project on buisness management
Class 12th IP project on buisness managementClass 12th IP project on buisness management
Class 12th IP project on buisness management
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Adarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptxAdarsh_Masekar(2GP19CS003).pptx
Adarsh_Masekar(2GP19CS003).pptx
 
Datasciencetools
DatasciencetoolsDatasciencetools
Datasciencetools
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Top 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App DevelopmentTop 7 Frameworks for Integration AI in App Development
Top 7 Frameworks for Integration AI in App Development
 
Study of Various Tools for Data Science
Study of Various Tools for Data ScienceStudy of Various Tools for Data Science
Study of Various Tools for Data Science
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Breast Cancer Prediction.pdf
Breast Cancer Prediction.pdfBreast Cancer Prediction.pdf
Breast Cancer Prediction.pdf
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020Overview data analyis and visualisation tools 2020
Overview data analyis and visualisation tools 2020
 
Python vs. r for data science
Python vs. r for data sciencePython vs. r for data science
Python vs. r for data science
 
C++
C++C++
C++
 
Python libraries
Python librariesPython libraries
Python libraries
 

Mehr von Sudipto Krishna Dutta

Mehr von Sudipto Krishna Dutta (14)

A Project Report on RFID Based Attendance System.pdf
A Project Report on RFID Based Attendance System.pdfA Project Report on RFID Based Attendance System.pdf
A Project Report on RFID Based Attendance System.pdf
 
RFID BASED ATTENDANCE SYSTEM.pptx
RFID BASED ATTENDANCE SYSTEM.pptxRFID BASED ATTENDANCE SYSTEM.pptx
RFID BASED ATTENDANCE SYSTEM.pptx
 
Memory hierarchy (In Details)
Memory hierarchy (In Details)Memory hierarchy (In Details)
Memory hierarchy (In Details)
 
Character Recognition using Data Mining Technique (Artificial Neural Network)
Character Recognition using Data Mining Technique (Artificial Neural Network)Character Recognition using Data Mining Technique (Artificial Neural Network)
Character Recognition using Data Mining Technique (Artificial Neural Network)
 
Central tendency
Central tendency Central tendency
Central tendency
 
Determination and Analysis of Sample size
Determination and Analysis of Sample sizeDetermination and Analysis of Sample size
Determination and Analysis of Sample size
 
Newborn Care
Newborn CareNewborn Care
Newborn Care
 
English Literature Book for BCS
English Literature  Book for BCSEnglish Literature  Book for BCS
English Literature Book for BCS
 
How to prepare for Bank exam in Bangladesh
How to prepare for Bank exam in Bangladesh How to prepare for Bank exam in Bangladesh
How to prepare for Bank exam in Bangladesh
 
Bcs study roadmap
Bcs study roadmapBcs study roadmap
Bcs study roadmap
 
Rooppur Atomic Power Plant
Rooppur Atomic Power PlantRooppur Atomic Power Plant
Rooppur Atomic Power Plant
 
Acute myocardial-infraction
Acute myocardial-infraction Acute myocardial-infraction
Acute myocardial-infraction
 
Prospectus and Drawbacks of E-commerce in Bangladesh
Prospectus and Drawbacks of E-commerce in BangladeshProspectus and Drawbacks of E-commerce in Bangladesh
Prospectus and Drawbacks of E-commerce in Bangladesh
 
Cybersecurity fundamental
Cybersecurity fundamentalCybersecurity fundamental
Cybersecurity fundamental
 

Kürzlich hochgeladen

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Kürzlich hochgeladen (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Toolboxes for data scientists

  • 1. Toolboxes for Data Scientists Sudipto Krishna Dutta 20204021 Introduction to Data Science Jahangirnagar University
  • 2. Introduction  Toolbox is a box, where many different built in functions arestored. Toolbox helps to perform a task efficiently and successfully, especially for data scientist and programmer. Choosing right toolbox can save a lot of time for doing a specific task within the targeted time. Using toolboxes also help to enhance the overall performance of any kind of task like data analysis from big data set and calculating the desired result. For example, if we want to calculate the co-relation coefficient, it is impossible to use a single piece of written code to handle a big data set or extract desired information from this. So, here toolbox can help to perform this task in an effective manner. Using toolbox we can call different built in functions to perform the desired task and to keep all this types of toolbox we can work simultaneously.
  • 3. Different Tool and their benefits There is different kind of tool and to use them is very beneficial for a data scientist. Toolboxes are like • Statistical Tool R/Python - is used for statistical analysis. Mean, Median, Mode, Standard Deviation are also in statistical tool. • Mathematical Tool SAS – Strong data analysis abilities, data management, data encryption Matlab – A numeric computing environment, Powerful graphics librabry, Can process complex mathematical operations. • Database Tool Apache Cassandra - is an open source and high scalable NoSQL database to manage massive amount of data in a faster manner. SQL – is a very popular and widely used but in data science it is recommended for it (i) Flexibility, (ii) Ease of use, (iii) No redundancy and (iv) Reliability In data science statistical toolbox is not the only toolbox for naming the data science it also need mathematical calculations/functions and database to read or write the task. So, all of them together can be called the data science. A lot of benefits lie in toolboxes for any data scientist. Here is some tasks, those can be done by using toolbox. Like • Big data analysis • Handling massive volume of data • Collection a large scale of data set • Building a structure for the operational data • Make a pattern and derive the valuable insights from chosen data set.
  • 4. Toolbox’s’ advantages over the other programming languages and Similarity among them  There are many programming language, like C, Fortran, C++,JAVA etc which are generally used for developing high-performance production or prototyping any kind of certain task or project but the problem is in those language many basic tools are not available or to re- implement those things again and again. So the advantages of toolboxes over the programming languages like  It has a number of built in function which can be use anywhere in the code by just calling them.  It is not needed to write specific code for the specific task by using toolbox because we can perform the needed task by calling the specific function which is stored in toolbox.  We can avoid re-complications of anything for introducing any kind of new function in the task.  Easy and all the basic functions are available in toolbox.  To find out the similarity among them we can identify some basic similarity in the working procedure. In toolbox, a collection of built in functions are there as well as in the programming language it has also owned some function by declaring them in the code. To generate a task by using toolbox we can call the specific built in function in the code and in programming language the needed function need to be written to complete the same task. If we consider the performance to generate the task we can find out some similarity among them. Both can be used for developing high performance production and prototyping and building a data structure. In environmental perspective both have similarity and both are supported object oriented programming. Both has basic statements for functional programming in its own core library.
  • 5. Why python is the best choice ? • Python is a widely used and very popular programming language to all. Even it has great properties for who is new to write computer program or even who never programmed. Though, python has the features for doing data science task more effectively and as we know that data science is not only about the statistical function it also owned the mathematical and database function to it. So, the combination of those three function we can call it data science and here in python tools we can see all of them. So, it is a major reason to choose python. Otherwise it has some most remarkable properties are easy to read code and has suppression of non-mandatory delimiters, dynamic typing and dynamic memory usage. The code is executed immediately in python console like IPython as Python has the ability to interpreter language. Which can give us a richer environment to execute python code? Flexibility is also the reason for choosing python. For this characteristic it can be seen as multiparadigm language. Among them it has the property to program with other languages and python also supports the object oriented paradigm and C programming language code can be mixed with python code and C code using cython. Python also has basic statements for functional programming in its own core library. Large Eco system is also another major reason for Choosing python.
  • 6. Python libraries for Data Scientist and theirs usages • Python community offers a huge number of developed toolboxes. This is very exciting that to know most of them can be used for data science. The most popular python toolboxes for any data scientist are  NumPy  SciPy  Pandas  Scikit- Learn
  • 7. NumPy and Scipy  NumPy is known as the basis of computing toolbox. It has served various kind of operational functions. Though SciPy is domain specific toolbox and it also has several functions. It has also statistical, mathematical and database tools.  NumPy is doing scientific computing with Python.  It provides multidimensional arrays with basic operations on them.  It is very useful in linear algebra function.  Several toolboxes use the NumPy array representation as an efficient basic data structure.  SciPy provides collection of numeric algorithms and domain specific toolboxes.  SciPy can process signal and optimization and handle statistical task.  SciPy is the plotting library Matplotlib and it has many tools for data visualization.
  • 8. SCIKIT-Learn • It is a machine learning library built from NumPy, SciPy and Matplotlib. • It offers simple and efficient tools for common tasks in data analysis such as,  Classification  Regression  Clustering  Dimensionality  Reduction  Model selection  Preprocessing
  • 9. Pandas Pandas have both statistical and database tools and it also provides hard performance, different type of tools and key features. • It provides high performance data structure and data analyzing tools. • It has a key feature to work fast and efficient dataframe object for data manipulation with integrated indexing. • The dataframe can be seen as spreadsheet which offers very flexibility. • In pandas we can easily transform any dataset in the way we want. • Reshaping, Adding or removing columns or rows. • Provides high performance functions for aggregating, merging and joining datasets. • Pandas also has tools for importing and exporting data from different formats, like  CSV  Microsoft Excel  SQL databases  Fast HDF5 format.
  • 10. Data Science Eco System • After choosing Python, we can set up a data scientist python ecosystem by individual toolboxes or to perform a bundle of installation with all needed toolboxes. For those who is new to here, It can be chosen to install the mentioned toolboxes like Python 2.X and Python 3.X , exactly in a order. • However if a bundle installation is chosen, the Anaconda python distribution is the good option. Because the Anaconda distribution provides integration of all the python toolboxes and applications needed for the data scientist into a single directory without mixing it with other python toolboxes installed on the machine. The toolboxes and applications such as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific tools for other related tasks such as data visualization, code, optimization and big data processing.
  • 11. IDE (Integrated Development Environments) • The integrated development environment is software and it is very essential tool for data scientist. IDEs is created to serve different purpose for the data scientist as well as the programmer. Thus, over the years this software has evolved in order to make the coding task less complicated. Selecting right IDEs for each person is very crucial and unfortunately there is no “one size fits all” programming environment. The best solution is to try the most popular IDE are the editor and the compiler and the debugger. Some IDEs can be used in multiple programming language and those provides by language specific plugins, such as NETBEANS or Eclips. • In the case of python there are a large number if specific IDEs, both commercial such as PyCharm and WingIDE and open source. The open source community helps IEDs to spring up, thus anyone can customize their own environment and share it with the rest if the community. For example Spyder (it is the Scientific Python Development Environment) is an IDE customized with the task of the data scientist in mind.
  • 12. Data Science Eco System • After choosing Python, we can set up a data scientist python ecosystem by individual toolboxes or to perform a bundle of installation with all needed toolboxes. For those who is new to here, It can be chosen to install the mentioned toolboxes like Python 2.X and Python 3.X , exactly in a order. • However if a bundle installation is chosen, the Anaconda python distribution is the good option. Because the Anaconda distribution provides integration of all the python toolboxes and applications needed for the data scientist into a single directory without mixing it with other python toolboxes installed on the machine. The toolboxes and applications such as NumPy, Pandas, SciPy, Matplotlib and Scikit-Learn, IPython, Spyder..etc but more specific tools for other related tasks such as data visualization, code, optimization and big data processing.
  • 13. IDE (Integrated Development Environments) • The integrated development environment is software and it is very essential tool for data scientist. IDEs is created to serve different purpose for the data scientist as well as the programmer. Thus, over the years this software has evolved in order to make the coding task less complicated. Selecting right IDEs for each person is very crucial and unfortunately there is no “one size fits all” programming environment. The best solution is to try the most popular IDE are the editor and the compiler and the debugger. Some IDEs can be used in multiple programming language and those provides by language specific plugins, such as NETBEANS or Eclips. • In the case of python there are a large number if specific IDEs, both commercial such as PyCharm and WingIDE and open source. The open source community helps IEDs to spring up, thus anyone can customize their own environment and share it with the rest if the community. For example Spyder (it is the Scientific Python Development Environment) is an IDE customized with the task of the data scientist in mind.
  • 14. WIDE(Web Integrated Development Environment)- Jupyter • Python has also been developed for web application, it is a new generation of IDEs for interactive language. Nowadays, such sessions are called notebooks and they are not only used in classrooms but also used to show results in presentations or on business dashboards. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. The recent spread of such notebooks is mainly due to IPython. Since December 2011, IPython has been issued as a browser version of its interactive console, called IPython notebook, which shows the Python execution results very clearly and concisely by means of cells. Cells can contain content other than code. For example, markdown cells can be added to introduce algorithms. In this Jupyter Notebook it is also possible to insert Matplotlib graphics to illustrate examples or even web pages. IPython notebook has been separated from IPython software and now it has become a part of a larger project. Jupyter,especiall for Julia, Python and R that aims to reuse the same WIDE for all these interpreted languages and not just Python. All old IPython notebooks are automatically imported to the new version when they are opened with the Jupyter platform.
  • 15. Python, Used in Data Science • We came to know about the python ecosystem, and the containing Toolboxes and interactive IDEs in different that environment and their widely uses.
  • 16. The Jupyter Notebook Environment • Here now we are discussing the Jupyter Notebook environment. we can start by launching the Jupyter notebook platform. This can be done by simply typing the command in terminal or command line. For example: : $ jupyter notebook • But if we chose the bundle installation, we can start the Jupyter notebook platform by clicking on the Jupyter Notebook icon installed by Anaconda in the start menu or on the desktop. If we use the command line, the root directory is the same directory where we launched the Jupyter notebook. Otherwise, if we use the Anaconda launcher, the root directory is the current user directory. Now, to start a new notebook, we only need to press the New NoteBook Python2
  • 17. • Button at the top on the right of the home page. By importing those toolboxes that we will need for our program. In the first cell we put the code to import the Pandas library as pd. This is for convenience; every time we need to use some functionality from the Pandas library, we will write pd instead of pandas. We will also import the two core libraries mentioned above: the numpy library as np and the matplotlib library as plt. • Need to write in commands: import pandas as pd import numpy as np import matplotlib.pyplot as plt • To execute just one cell, we need to press the pause sign button or to click Cell -> Run or press the keys Ctrl + Enter. While execution is underway, the header of the cell shows the * mark: import pandas as pd import numpy as np import matplotlib.pyplot as plt
  • 18. • While a cell is being executed, no other cell can be executed. If we try to execute another cell, its execution will not start until the first cell has finished its execution. Once the execution is finished, the header of the cell will be replaced by the next number of execution. Since this will be the first cell executed, the number shown will be 1. If the process of importing the libraries is correct, no output cell is produced. import pandas as pd import numpy as np import matplotlin.pyplot as plt
  • 19. The DataFrame Data Structer • data structure in Pandas is the DataFrame object. A DataFrame is basically a tabular data structure, with rows and columns. Rows have a specific index to access them, which can be any name or value. In Pandas, the columns are called Series, a special type of data, which in essence consists of a list of several values, where each value has an index. Therefore, the DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. To understand how it works, let us see how to create a DataFrame from a common Python dictionary of lists. First, we will create a new cell by clicking Insert -> Insert Cell Below or pressing the keys Ctrl+B For example, the following code: import pandas as pd # a simple int list list = [1,2,3,4,5]
  • 20. # create series form a int list res = pd.Series(list) print(res) the result will be like: 0 1 2 3 4 5 dtype: int64 import pandas as pd dic = { 'Id': 1013, 'Name': 'Sudipto','State': 'Khulna','Age': 27} res = pd.Series(dic) print(res) the result will be like: Id 1013 Name Sudipto State Khulna Age 27 dtype: object Apart from DataFrame data structure creation, Panda offers a lot of functions to manipulate them. Among other things, it offers us functions for aggregation, manipulation, and transformation of the data. In the ollowing sections, we will introduce some of these functions.
  • 21. Data Analysis Example Using Pandas • we can use Pandas in a simple real problem, we will start doing some basic analysis of any data. For the sake of transparency, data produced that must be open, meaning that they can be freely used, reused, and distributed by anyone. • Pandas is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of tabular data. • The main data structures in Pandas are implemented with series and dataframes classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of Series instances. DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances. Import numpy as np Import pandas as pd Pd.set_option(“display.precision”, 2)
  • 22. • We will demonstrate the main methods in action by analyzing a dataset on the churn rate of telecom operator clients. Let’s read the data and take a look at the 5 lines using the head method, df = pd. Read_csv(“../input/telecom_churn.csv”) df.head() • About printing dataframe in jupyter notebooks recall that each row corresponds to one client, an instance, and columns are features of this instance. print(df.shape) (3333, 20) • From the output, we can see that the table contains 3333 rows and 20 columns. If we want to print out the column name using columns: print(df.columns) • We can use the info( ) methods some genatral information about the dataframe. print (df.info( ) )
  • 23. • We see that one feature is logical (bool), 3 features are of type object, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with shape. • We can change the column type with the astype method. Lets apply this to the Churn feature to convert it in to int64: df[“ churn ”]= df [“Churn”]. astype(“int64”) • To describe method shows basic statistical charterstics of each numeric feature (int 64 and float64 types): number if non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles. df.describes( ) • In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the include parameter. df.describe(include=[“object”, “bool”] )
  • 24. • To delete columns or rows, use the drop method, passing the required indexes and the axis parameter (1 if you delete columns, and nothing or 0 if you delete rows). The inplace argument tells whether to change the original DataFrame. With inplace=False, the drop method doesn't change the existing DataFrame and returns a new one with dropped rows or columns. With inplace=True, it alters the DataFrame. #get rid of just created columns df. Drop ([“Total charge”, “Toatal calls”], axis = 1, inplace= True) #and here is how you can delete rows df. Drop ([1,2]). head()
  • 25. Reading Data • To read the data from that we downloaded. First of all, we have to create a new notebook called Open Government Data Analysis and open it. Then, after ensuring that the educ_figdp_1_Data.csv file is stored in the same directory as our notebook directory, we will write the following code to read and show the content: edu = pd.read_csv (‘files/ch02/educ_figdp_1_Data.csv’, na_values = ‘ : ’, usecols = [“TIME”,”GEO”,”VALUES”]) edu • Beside this, Pandas also has functions for reading files with formats such as Excel, HDF5, tabulated files, or even the content from the clipboard (read_excel(), read_hdf(), read_table(), read_clipboard()). • If we want to know the names of the columns or the names of the indexes, we can use the DataFrame attributes columns and index respectively. The names of the columns or indexes can be changed by assigning a new list of the same length to these attributes. The values of any DataFrame can be retrieved as a Python array by calling its values attribute. If we just want quick statistical information on all the numeric columns in a DataFrame, we can use the function describe(). The result shows the count, the mean, the standard deviation, the minimum and maximum, and the percentiles, by default, the 25th, 50th, and 75th, for all the values in each column or series. edu.describe ( )
  • 26. Selecting Data • If we want to select a subset of data from a DataFrame, it is necessary to indicate this subset using square brackets ([ ]) after the DataFrame. The subset can be specified in several ways. If we want to select only one column from a DataFrame, we only need to put its name between the square brackets. The result will be a Series data structure, not a DataFrame, because only one column is retrieved. edu [‘ value’] • If we want to select a subset of rows from a DataFrame, we can do so by indicating a range of rows separated by a colon (:) inside the square brackets. This is commonly known as a slice of rows: edu [ 10 : 14 ] • For example, We assume a scenario and observe it, • 13 2001 European Union (27 countries) 4.99 This instruction returns the slice of rows from the 10th to the 13th position. Note that the slice does not use the index labels as references, but the position. In this case, the labels of the rows simply coincide with the position of the rows. If we want to select a subset of columns and rows using the labels as our references instead of the positions, we can use ix indexing: edu.ix [90 : 94, [‘TIME’ , ‘GEO’] ]
  • 27. Filtering Data • Another way to select a subset of data is by applying Boolean indexing. This indexing is commonly known as a filter. For instance, if we want to filter those values less than or equal to 6.5, we can do it like this: edu [ edu [‘value’] > 6.5 . tail ( ) • The Boolean operation edu[’Value’] > 6.5 produces a Boolean mask. When an element in the “Value” column is greater than 6.5, the corresponding value in the mask is set to True, otherwise it is set to False. Then, when this mask is applied as an index in edu[edu[’Value’] > 6.5], the result is a filtered DataFrame containing only rows with values higher than 6.5. Of course, any of the usual Boolean operators can be used for filtering: < (less than),<= (less than or equal to), > (greater than), >= (greater than or equal to), = (equal to), and ! = (not equal to).
  • 28. Filtering Missing Values • Pandas uses the special value NaN (not a number) to represent missing values. In Python, NaN is a special floating-point value returned by certain operations when one of their results ends in an undefined value. A subtle feature of NaN values is that two NaN are never equal. Because of this, the only safe way to tell whether a value is missing in a DataFrame is by using the isnull() function. Indeed, this function can be used to filter rows with missing values : edu [edu [“value”].isnull ( ) ]. head ( )
  • 29. Manipulating Data • To manipulate data we need to know how to select the desired data. One of the most straightforward things we can do is to operate with columns or rows using aggregation functions. , If a function is applied to a DataFrame or a selection of rows and columns, then you can specify if the function should be applied to the rows for each column (setting the axis=0 keyword on the invocation of the function), or it should be applied on the columns for each row (setting the axis=1 keyword on the invocation of the function). edu.max ( axis = 0) • Note that these are functions specific to Pandas, not the generic Python functions. There are differences in their implementation. In Python, NaN values propagate through all operations without raising an exception. In contrast, Pandas operations exclude NaN values representing missing data. For example, the pandas max function excludes NaN values, thus they are interpreted as missing values, while the standard Python max function will take the mathematical interpretation of NaN and return it as the maximum: Input: print “pandas max function : “ ,edu [ ‘ value ‘]. max ( ) print “pandas max function : “ ,max ( edu [ ‘ value ‘] ) Output: Pandas max function : 8.81 Python max function: nan
  • 30. • Beside these aggregation functions, we can apply operations over all the values in rows, columns or a selection of both. The rule of thumb is that an operation between columns means that it is applied to each row in that column and an operation between rows means that it is applied to each column in that row. For example we can apply any binary arithmetical operation (+,-,*,/) to an entire row: Input: S = edu [ “ Value ” ] / 100 S. head () Output: 0 NaN 1 Nan 2 0.0500 3 0.0503 4 0.0495 Name: Value, dtype : float64
  • 31. Sorting • This is a important functionality we will need when inspecting our data is to sort by columns. We can sort a DataFrame using any column, using the sort function. If we want to see the first five rows of data sorted in descending order (i.e., from the largest to the smallest values) and using the Value column, then we just need to do this: edu . sort_values (by = ‘value’ , ascending = False, inplace = True ) edu. head ( ) • that the inplace keyword means that the DataFrame will be overwritten, and hence no new DataFrame is returned. If instead of ascending = False we use ascending = True, the values are sorted in ascending order (i.e., from the smallest to the largest values). If we want to return to the original order, we can sort by an index using the sort_index function and specifying axis=0: edu.sort_index (axis = 0, ascending = True, inplace = True ) edu. head ( )
  • 32. Ranking Data • In statistics, “ranking” refers to the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. If, for example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. • Now we can perform the ranking using the rank function. Note here that the parameter ascending=False makes the ranking go from the highest values to the lowest values. The Pandas rank function supports different tie-breaking methods, specified with the method parameter. In our case, we use the first method, in which ranks are assigned in the order they appear in the array, avoiding gaps between ranking. pivedu = pivedu.drop([ ’Euro area (13 countries)’, ’Euro area (15 countries)’, ’Euro area (17 countries)’, ’Euro area (18 countries)’, ’European Union (25 countries)’, ’European Union (27 countries)’, ’European Union (28 countries)’ ] , axis = 0) pivedu = pivedu.rename(index = {’Germany ( until 1990 former territory of the FRG)’: ’Germany’}) pivedu = pivedu.dropna() pivedu.rank( ascending = False , method = ’first’).head() • If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this way: totalSum = pivedu. sum(axis = 1) totalSum. rank( ascending = False , method = ’dense’) .sort_values(). head()
  • 33. • If we want to make a global ranking taking into account all the years, we can sum up all the columns and rank the result. Then we can sort the resulting values to retrieve the top five countries for the last 6 years, in this way: totalSum = pivedu. sum(axis = 1) totalSum. rank( ascending = False , method =’dense’) .sort_values(). head()
  • 34. Plotting • Pandas DataFrames and Series can be plotted using the plot function, which uses the library for graphics Matplotlib. For example, if we want to plot the accumulated values for each country over the last 6 years, we can take the Series obtained in the previous example and plot it directly by calling the plot function as shown in the next cell: totalSum = pivedu. sum(axis = 1) .sort_values(ascending = False) totalSum. plot(kind = ’bar’, style = ’b’, alpha = 0.4, title = "Total Values for Country") • It is also possible to plot a DataFrame directly. In this case, each column is treated as a separated Series. For example, instead of printing the accumulated value over the years, we can plot the value for each year. my_colors = [’b’, ’r’, ’g’, ’y’, ’m’, ’c’] ax = pivedu. plot(kind = ’barh’, stacked = True , color = my_colors) ax.legend(loc = ’center left’, bbox_to_anchor = (1, .5)
  • 35. Why ToolBox is improved version of Sub functional language • ToolBox offers features over the other programming language and the toolboxes are the updated and improved version of any kind of sub functional language. Because ToolBox has all the feature including other programming language. In toolbox we have all function to perform when it is needed but in other programming language those all features are not available in package like TollBox. We can call the built in functions, which are stored in toolbox at anywhere in the programmer without fetching any kind of complication or error. But in other sub functional programming language does not offer those kind of built in function, there we need to declare the function to perform it. But sometimes it shows that this declared functions shows different kinds of error like missing arguments or functional error. So after considering all the resources, we can make sure that ToolaBox is definitely the improved version of other programming/Sub functional language.
  • 36. Conclusion • Data Science is like the sea and the tools that data scientist use is like the elements inside the sea water. So, to handle this massive task we need a complete package to run it efficiently. Data Scientist handles this in a smart manner like ToolBox. It helps data scientist to work more efficiently and obviously considering the performance. We must to say about the Python’s ecosystem to have all those things. It offers a perfect way to perform like a pro. Python ecosystem offers a complete package to a data scientist to lead the task in a efficient manner for developing any data scientist projects.