2. Introduction Of Data Science
• Data science is the field of study that combines domain expertise, programming
skills, and knowledge of mathematics and statistics to extract meaningful insights
from data. Data science combines multiple fields, including statistics, scientific
methods, artificial intelligence (AI), and data analysis, to extract value from data.
Those who practice data science are called data scientists, and they combine a
range of skills to analyze data collected from the web, smartphones, customers,
sensors, and other sources to derive actionable insights.
• Data science encompasses preparing data for analysis, including cleansing,
aggregating, and manipulating the data to perform advanced
data analysis.
3. Features of Data Science :-
• Responsive Construct
• Flexible
• Easily Trainable
• Feature Columns
• Open Source
• Parallel Network Training
• Visualizer
• Availability of Statistical Distributions
• Layered Components
• Event Logger
4. Different sectors where we using data science
Financial Industry
Travel industry
Manufacturing
Banking Sector Educational
Gaming
DATA SCIENCE
5. Purpose of python in data science
∙ It uses the elegant syntax , hence the programs are easier to read.
∙ It is a simple to access language, which makes it easy to achieve the
program working.
∙ The large standard library and community support.
∙ The interactive mode of Python makes its simple to test codes.
∙ Python is an expressive language
7. Data Analysis
• Data Analysis is a process of collecting, transforming, cleaning, and
modeling data with the goal of discovering the required information.
• A simple example of Data analysis is whenever we take any decision in our
day-to-day life is by thinking about what happened last time or what will
happen by choosing that particular decision. This is nothing but analyzing
our past or future and making decisions based on it.
8. Data Analysis Process consists of the following phases that
are iterative in nature –
9. Data Analysis
Data Requirements Specification
❖ The data required for analysis is based on a question or an experiment. Based on the requirements
of those directing the analysis, the data necessary as inputs to the analysis is identified (e.g.,
Population of people).
Data Collection
❖ Data Collection is the process of gathering information on targeted variables identified as
data requirements.
Data Processing
❖ The data that is collected must be processed or organized for analysis.
10. Data Analysis
Data Cleaning
❖ The processed and organized data may be incomplete, contain
duplicates, or contain errors. Data Cleaning is the process of
preventing and correcting these errors.
Data Analysis
❖ Data that is processed, organized and cleaned would be ready for the
analysis. Various data analysis techniques are available to understand,
interpret, and derive conclusions based on the requirements.
Communication
❖ The results of the data analysis are to be reported in a format as
required by the users to support their decisions and further action.
11. EDA (Exploratory Data Analysis)
• Exploratory data analysis (EDA) is a method of analyzing and investigating the
data sets to summaries their main characteristics.
• EDA focuses more narrowly on checking assumptions required for model fitting
and hypothesis testing. It also checks while handling missing values and making
transformations of variables as needed.
• EDA build a robust understanding of the data, issues associated with either the
info or process. it’s a scientific approach to get the story of the data.
12. EDA Process
STEP 1: Import python libraries
STEP 2: We will now read the data from a CSV file.
Step 3: head ( ) - By default, it returns the first 5 rows of the Data frame
13. • Step 4: tail ( ) - By default, it returns the last 5 rows of the Data frame. This function is used to get the last n
rows. This function returns the last n rows from the object based on position
• Step 5: describe () - Return a statistical summary for numerical columns present in the dataset.
14. • Step 6:shape - It shows the number of dimensions as well as the size in each dimension.
• Step 7: columns - Return the column labels of the data frame.
• Step 8: nunique ( ) - Return number of unique elements in the object. It counts the number of unique
entries over columns or rows.
15. .
• Step 9: isnull ( ).sum ( ) - Return the number of missing values in each column.
• Step 10: drop is use for Removing Columns .
• Step 11: Correlation is a measurement that describes the relationship between two variables.
16. • . Step 12: A correlation heatmap is a heatmap that shows a 2D correlation matrix between two discrete
dimensions, using colored cells to represent data from usually a monochromatic scale. The values of the first
dimension appear as the rows of the table while of the second dimension as a column. The color of the cell
is proportional to the number of measurements that match the dimensional value
Step 13 : Pairplot is a module of seaborn library .To plot multiple pairwise bivariate distributions in a dataset,
you can use the pairplot() function. This shows the relationship for (n, 2) combination of variable in a
DataFrame as a matrix of plots and the diagonal plots are the univariate plots
17. TYPES OF EXPLORATORY DATA ANALYSIS (EDA)
❖There are four types of EDA in all :-
1. Univariate Non-graphical
2. Univariate graphical
3. Multivariate Non-graphical
4. Multivariate graphical
18. TYPES OF EXPLORATORY DATA ANALYSIS (EDA)
Univariate non-graphical:
❖ This is the simplest form of data analysis among the four options.
In this type of analysis, the data that is being analysed consists of
just a single variable.
Univariate graphical:
❖ Unlike the non-graphical method, the graphical method provides
the full picture of the data. The three main methods of analysis
under this type are histogram, stem and leaf plot, and box plots.
19. TYPES OF EXPLORATORY DATA ANALYSIS (EDA)
Multivariate non-graphical:
❖ Multivariate non-graphical EDA technique is usually wont to show the connection
between two or more variables within the sort of either cross-tabulation or
statistics.
Multivariate graphical:
❖ This type of EDA displays the relationship between two or more set of data. A bar
chart, where each group represents a level of one of the variables and each bar
within the group represents levels of other variables.
Other common sorts of multivariate graphics are:
• Scatterplot
• Run chart
• Heat map
• Multivariate chart
• Bubble chart
20. EXPLORATORY DATA ANALYSIS (EDA) TOOLS
Python :
• EDA can be done using python for
identifying the missing value in a data set.
Other functions that can be performed are —
the description of data, handling outliers,
getting insights through the plots. Its high-
level, built-in data structure and dynamic
typing and binding make it an attractive tool
for EDA.
• Analyzing a dataset is a hectic task that takes
a lot of time. Python provides certain open-
source modules that can automate the whole
process of EDA and help in saving time.
R:
• The R language is used widely by
data scientists and statisticians for
developing statistical observations and
data analysis.
• R is an open-source programming
language that provides a free software
environment for statistical computing
and graphics that is supported by the R
Foundation for Statistical Computing.