SlideShare ist ein Scribd-Unternehmen logo
1 von 47
UNIT 2 INTRODUCTION TO DATA SCIENCE
Introduction
• Data science deals with the concept of extracting useful knowledge
from huge data to solve business problems by following a process
• Data science includes data analysis as an important component
Components of data Science
• Statistics: Used to collect and analyze the numerical data in large
amount to find meaningful insights.
• Visualization: Representing data in visual context to understand the
data.
• Data Engineering: This includes acquiring, storing, retrieving and
transforming data.
• Advanced computing: This includes designing, writing, debugging and
maintaining the source code of computers.
• Machine learning: Providing training to the machines.
Advantages of Data Science
• Faster and better decision making
• Improves marketing and sales
• Selection of CV’s . Recruitment process easier.
• Reaching customers
Disadvantage of Data Science
• Information can be misused.
• Tools used for data science and analysis are expensive.
• Tools are complex to understand
Application of Data Science
• Fraud and risk detection
• Health care
• Virtual assistance for patients and customer support
• Internet search
• Website recommendation
• Advanced image recognition
• Speech recognition
• Airline route planning
• Gaming
• Augmented reality
Data Science Process
Step 1: Frame the problem
Step 2: Collect the raw data needed for your problem
Step 3: Process the data for analysis
Step 4: Explore the data
Step 5: Perform in-depth analysis
Step 6: Communicate results of the analysis
Basic of Data Analysis
• Data analytics is the science of examining raw data with the purpose
of drawing conclusions about that information.
• Data Analytics is a process of inspecting, cleansing, transforming and
modeling data with the goal of discovering useful information, and
supporting decision making.
What is Analytics
• Data: it is raw unorganized ,
• “data are a set of values of qualitative or quantitative variables about
one or more persons or objects, while a datum (singular of data) is a
single value of a single variable.”
• Information: when we analyze raw data it provides some sort of
understanding called information
• Data analytics is the science of analyzing raw data in order to make
conclusions about that information. Many of the techniques and
processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human
consumption.
• The process involved in data analysis involves several different steps:
• The first step is to determine the data requirements or how the data is grouped. Data
may be separated by age, demographic, income, or gender. Data values may be
numerical or be divided by category.
• The second step in data analytics is the process of collecting it. This can be done
through a variety of sources such as computers, online sources, cameras, environmental
sources, or through personnel.
• Once the data is collected, it must be organized so it can be analyzed. Organization
may take place on a spreadsheet or other form of software that can take statistical data.
• The data is then cleaned up before analysis. This means it is scrubbed and checked to
ensure there is no duplication or error, and that it is not incomplete. This step helps
correct any errors before it goes on to a data analyst to be analyzed.
Phase 1: Data Discovery and Formation
Phase 2: Data Preparation and Processing
Phase 3: Design a Model
Phase 4: Model Building
Phase 5: Result Communication and Publication
Phase 6: Measuring of Effectiveness
Phase 1: Data Discovery and Formation
• Everything begins with a defined goal. In this phase, you’ll define your
data’s purpose and how to achieve it by the time you reach the end of
the data analytics lifecycle.
• Essential activities in this phase include structuring the business
problem in the form of an analytics challenge and formulating the
initial hypotheses (IHs) to test and start learning the data. The
subsequent phases are then based on achieving the goal that is
drawn in this stage.
Phase 2: Data Preparation and Processing
• This stage consists of everything that has anything to do with data. In
phase 2, the attention of experts moves from business requirements
to information requirements.
• The data preparation and processing step involve collecting,
processing, and cleansing the accumulated data.
Data is collected using the below methods:
• Data Acquisition: Accumulating information from external sources.
• Data Entry: Formulating recent data points using digital systems or
manual data entry techniques within the enterprise.
• Signal Reception: Capturing information from digital devices, such as
control systems and the Internet of Things.
Phase 3: Design a Model
• After mapping out your business goals and collecting a glut of data (structured,
unstructured, or semi-structured), it is time to build a model that utilizes the data
to achieve the goal.
• There are several techniques available to load data into the system and start
studying it:
• ETL (Extract, Transform, and Load) transforms the data first using a set of
business rules, before loading it into a sandbox.
• ELT (Extract, Load, and Transform) first loads raw data into the sandbox and
then transform it.
• ETLT (Extract, Transform, Load, Transform) is a mixture; it has two
transformation levels.
Phase 4: Model Building
• This step of data analytics architecture comprises developing data
sets for testing, training, and production purposes. The data analytics
experts meticulously build and operate the model that they had
designed in the previous step.
• They rely on tools and several techniques like decision trees,
regression techniques and neural networks for building and executing
the model. The experts also perform a trial run of the model to
observe if the model corresponds to the datasets.
Phase 5: Result Communication and Publication
• Now is the time to check if those criteria are met by the tests you
have run in the previous phase.
• The communication step starts with a collaboration with major
stakeholders to determine if the project results are a success or
failure. The project team is required to identify the key findings of the
analysis, measure the business value associated with the result, and
produce a narrative to summarise and convey the results to the
stakeholders.
Phase 6: Measuring of Effectiveness
• The final step is to provide a detailed report with key findings, coding,
briefings, technical papers/ documents to the stakeholders.
• Additionally, to measure the analysis’s effectiveness, the data is
moved to a live environment from the sandbox and monitored to
observe if the results match the expected business goal. If the
findings are as per the objective, the reports and the results are
finalized. However, suppose the outcome deviates from the intent set
out in phase 1then. You can move backward in the data analytics
lifecycle to any of the previous phases to change your input and get a
different output.
TYPES OF ANALYTICS
Descriptive analytics
• What happened?
• What is happening?
• Descriptive analytics answers the question of what happened.
• Descriptive analytics juggles raw data from multiple data sources to
give valuable insights into the past. However, these findings simply
signal that something is wrong or right, without explaining why. For
this reason, our data consultants don’t recommend highly data-driven
companies to settle for descriptive analytics only, they’d rather
combine it with other types of data analytics.
• An examples of this could be a monthly profit and loss statement
Diagnostic analytics
• At this stage, historical data can be measured against other data to
answer the question of why something happened.
• For example, if you’re conducting a social media marketing campaign,
you may be interested in assessing the number of likes, reviews,
mentions, followers or fans. Diagnostic analytics can help you distill
thousands of mentions into a single view so that you can make
progress with your campaign.
• Diagnostic analytics gives in-depth insights into a particular problem.
Predictive analytics
• Predictive analytics tells what is likely to happen
• Predictive analytics is the use of data, machine learning techniques,
and statistical algorithms to determine the likelihood of future results
based on historical data. The primary goal of predictive analytics is to
help you go beyond just what has happened and provide the best
possible assessment of what is likely to happen in future.
• Predictive analytics can be used in banking systems to detect fraud
cases, measure the levels of credit risks, and maximize the cross-sell
and up-sell opportunities in an organization. This helps to retain
valuable clients to your business.
Prescriptive analytics
• The purpose of prescriptive analytics is to literally prescribe what
action to take to eliminate a future problem or take full advantage of
a promising trend.
Statistical Inference
Statistical inference is the process of using data
analysis to deduce properties of an underlying
distribution of probability.
Inferential statistical analysis infers properties of a
population, for example by testing hypotheses
and deriving estimates. It is assumed that the
observed data set is sampled from a larger
population.
Statistical Estimation
• An estimator is a statistical parameter that provides an estimation of
a population parameter.
• The sample mean, is a point estimator for the population mean, .
• Example: The mean of the age of men attending a show is 32 years.
Statistical Hypothesis testing
• Hypothesis testing is an act in statistics whereby an analyst tests an
assumption regarding a population parameter. The methodology
employed by the analyst depends on the nature of the data used and
the reason for the analysis. Hypothesis testing is used to assess the
plausibility of a hypothesis by using sample data
Population and sample
• A population is the entire group that you want to draw conclusions
about.
• A sample is the specific group that you will collect data from. The size
of the sample is always less than the total size of the population.
• In research, a population doesn’t always refer to people. It can mean
a group containing elements of anything you want to study, such as
objects, events, organizations, countries, species, organisms, etc.
Reasons for sampling
Necessity: Sometimes it’s simply not possible to study the
whole population due to its size or inaccessibility.
Practicality: It’s easier and more efficient to collect data from
a sample.
Cost-effectiveness: There are fewer participant, laboratory,
equipment, and researcher costs involved.
Manageability: Storing and running statistical analyses on
smaller datasets is easier and reliable.
Statistical modeling
• Statistical modeling is the process of applying statistical analysis to a
dataset. A statistical model is a mathematical representation (or
mathematical model) of observed data.
• When data analysts apply various statistical models to the data they
are investigating, they are able to understand and interpret the
information more strategically.
• “When you analyze data, you are looking for patterns,”
steps of statistical model building process
Model Selection
• Based on the defined goal(s) (supervised or unsupervised) we have to select one of or combinations of
modeling techniques. Such as
• General linear model
• Non-Linear Regression
• Linear Regression
• Ridge Regression
• Non-Negative Garrotte Regression
• Percentage Regression
• Quantile Regression
• Non-parametric regression
• Logistic Regression
• Probit Regression
• Classification/Decision Trees
• Random Forest
• Support Vector Machine (SVM)
• Distance metric learning
• Bayesian methods
• Graphical Models
• Neural Networks
• Genetic Algorithm
• The Hazard and Survival Functions
• Time Series Models
• Signal Processing
• Clustering Techniques
• Market Basket Analysis
• Frequent Itemset Mining
• Association Rule Mining etc.
Build/Develop/Train Models/Model fitting
• Validate the assumptions of the chosen algorithm
• Check for Redundancies of Independent Variables (Features).
Sometime in Machine Learning, we are keen on accuracies of the
models and hence we may not perform these checks!
• Develop/Train Model on Training Sample, which is
80%/70%/60%/50% of the available data(Population)
• Check Model performance - Error, Accuracy
• Validate/Test Models
• Score and Predict using Test Sample
• Check for the robustness and stability of the model
• Check Model Performance: Accuracy, ROC, AUC, KS, GINI etc.
• AUC (Area Under The Curve)
• ROC (Receiver Operating Characteristics) curve
Probability
• Probability theory, a branch of mathematics concerned with the
analysis of random phenomena. The outcome of a random event
cannot be determined before it occurs, but it may be any one of
several possible outcomes. The actual outcome is considered to be
determined by chance.
• The set of all possible outcomes of an experiment is called a “sample
space.”
• The experiment of tossing a coin once results in a sample space with
two possible outcomes, “heads” and “tails.”
• Tossing two dice has a sample space with 36 possible outcomes
Probability and data science
• Randomness and uncertainty are imperative in the world and thus, it
can prove to be immensely helpful to understand and know the
chances of various events. Learning of probability helps you in
making informed decisions about likelihood of events, based on a
pattern of collected data.
• In the context of data science, statistical inferences are often used to
analyze or predict trends from data, and these inferences
use probability distributions of data. Thus, your efficacy of working
on data science problems depends on probability and its applications
to a good extent.
Probability distribution
• Probability distribution is a function that describes all the possible likelihoods and
values that can be taken by a random variable within a given range. For a
continuous random variable, the probability distribution is described by the
probability density function. And for a discrete random variable, it’s a probability
mass function that defines the probability distribution.
• Probability distributions are categorized into different classifications like binomial
distribution, chi-square distribution, normal distribution, Poisson distribution etc.
Different probability distributions represent different data generation process
and cater to different purposes. For instance, the binomial distribution evaluates
the probability of a particular event occurring many times over a given number of
trials as well as given the probability of the event in each trial. The normal
distribution is symmetric about the mean, demonstrating that the data closer to
the mean are more recurrent in occurrence compared to the data far from the
mean.
Discrete Probability distribution
• Binomial Distribution
A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE
outcome in an experiment or survey that is repeated multiple times. The binomial is a type of
distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin
toss has only two possible outcomes: heads or tails and taking a test could have two possible
outcomes: Pass or fail.
• Geometric Distribution
• The probability distribution of the number X of trials needed to get one
success, supported on the set { 1, 2, 3, ... }
• For example, suppose an ordinary die is thrown repeatedly until the
first time a "1" appears. The probability distribution of the number of
times it is thrown is supported on the infinite set { 1, 2, 3, ... } and is a
geometric distribution with p = 1/6.
• Poisson Distribution
• Is a discrete probability distribution that expresses the probability of a
given number of events occurring in a fixed interval of time or space if
these events occur with a known constant mean rate and
independently of the time since the last event. The Poisson
distribution can also be used for the number of events in other
specified intervals such as distance, area or volume.
• examples that may follow a Poisson distribution include the number
of phone calls received by a call center per hour and the number of
decay events per second from a radioactive source.
continuous Probability distribution
• Uniform Distribution
• Normal Distribution
Uniform Distribution
• In statistics, uniform distribution is a term used to describe a form of
probability distribution where every possible outcome has an equal
likelihood of happening. The probability is constant since each
variable has equal chances of being the outcome.
Normal Distribution
• The normal distribution is the most important probability distribution
in statistics because it fits many natural phenomena.
• For example, heights, blood pressure IQ scores follow the normal
distribution. It is also known as the Gaussian distribution and the bell
curve.
Correlation
• Correlation is Positive when the values increase together, and
• Correlation is Negative when one value decreases as the other
increases
Regression
• Simple Regression
• Multiple regression
Regression formula
• y=a+bx
Regression Analysis
• Statement of the problem under consideration
• Choice of relevant variables
• Collection of data on relevant variables
• Specification of model
• Choice of method for fitting the data
• Fitting of model
• Model validation and criticism
• Using the chosen model for the solution
Application of Regression Analysis
• Predictive Analytics
• Operation efficiency
• Supporting decisions
• Correcting errors
• New insights

Weitere ähnliche Inhalte

Ähnlich wie Introduction to data science

Understanding the Basics of Data Analytics
Understanding the Basics of Data AnalyticsUnderstanding the Basics of Data Analytics
Understanding the Basics of Data AnalyticsAttitude Tally Academy
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 
7.-Data-Analytics.pptx
7.-Data-Analytics.pptx7.-Data-Analytics.pptx
7.-Data-Analytics.pptxmarow75067
 
Data analysis (Seminar for MR) (1).pptx
Data analysis (Seminar for MR) (1).pptxData analysis (Seminar for MR) (1).pptx
Data analysis (Seminar for MR) (1).pptxCHIPPYFRANCIS
 
Data Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessData Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessMcKonly & Asbury, LLP
 
unit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptxunit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptxProf. Kanchan Kumari
 
unit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptxunit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptxProf. Kanchan Kumari
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxPratikshaSurve4
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoTShivam Singh
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptxDikshantSharma63
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfmustaq4
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxAbdullahEmam4
 

Ähnlich wie Introduction to data science (20)

Understanding the Basics of Data Analytics
Understanding the Basics of Data AnalyticsUnderstanding the Basics of Data Analytics
Understanding the Basics of Data Analytics
 
Data driven decision making
Data driven decision makingData driven decision making
Data driven decision making
 
Lesson1.2.pptx.pdf
Lesson1.2.pptx.pdfLesson1.2.pptx.pdf
Lesson1.2.pptx.pdf
 
KIT601 Unit I.pptx
KIT601 Unit I.pptxKIT601 Unit I.pptx
KIT601 Unit I.pptx
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 
7.-Data-Analytics.pptx
7.-Data-Analytics.pptx7.-Data-Analytics.pptx
7.-Data-Analytics.pptx
 
Stages in Analytics
Stages in AnalyticsStages in Analytics
Stages in Analytics
 
Data analysis (Seminar for MR) (1).pptx
Data analysis (Seminar for MR) (1).pptxData analysis (Seminar for MR) (1).pptx
Data analysis (Seminar for MR) (1).pptx
 
Data Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better BusinessData Analytics: Better Decision, Better Business
Data Analytics: Better Decision, Better Business
 
unit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptxunit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptx
 
unit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptxunit 4 deta analysis bbaY Dr kanchan.pptx
unit 4 deta analysis bbaY Dr kanchan.pptx
 
Data Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptxData Processing & Explain each term in details.pptx
Data Processing & Explain each term in details.pptx
 
KPMG_Task2.pptx
KPMG_Task2.pptxKPMG_Task2.pptx
KPMG_Task2.pptx
 
Data Analytics and Big Data on IoT
Data Analytics and Big Data on IoTData Analytics and Big Data on IoT
Data Analytics and Big Data on IoT
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Data Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdfData Science Unit1 AMET.pdf
Data Science Unit1 AMET.pdf
 
Moh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptxMoh.Abd-Ellatif_DataAnalysis1.pptx
Moh.Abd-Ellatif_DataAnalysis1.pptx
 

Kürzlich hochgeladen

Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Developmentchesterberbo7
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationdeepaannamalai16
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operationalssuser3e220a
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmStan Meyer
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...DhatriParmar
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdfMr Bounab Samir
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxAneriPatwari
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxMichelleTuguinay1
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research DiscourseAnita GoswamiGiri
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 

Kürzlich hochgeladen (20)

Using Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea DevelopmentUsing Grammatical Signals Suitable to Patterns of Idea Development
Using Grammatical Signals Suitable to Patterns of Idea Development
 
Congestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentationCongestive Cardiac Failure..presentation
Congestive Cardiac Failure..presentation
 
Expanded definition: technical and operational
Expanded definition: technical and operationalExpanded definition: technical and operational
Expanded definition: technical and operational
 
prashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Professionprashanth updated resume 2024 for Teaching Profession
prashanth updated resume 2024 for Teaching Profession
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Oppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and FilmOppenheimer Film Discussion for Philosophy and Film
Oppenheimer Film Discussion for Philosophy and Film
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
Blowin' in the Wind of Caste_ Bob Dylan's Song as a Catalyst for Social Justi...
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
MS4 level being good citizen -imperative- (1) (1).pdf
MS4 level   being good citizen -imperative- (1) (1).pdfMS4 level   being good citizen -imperative- (1) (1).pdf
MS4 level being good citizen -imperative- (1) (1).pdf
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptx
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptxDIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
DIFFERENT BASKETRY IN THE PHILIPPINES PPT.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
Scientific Writing :Research Discourse
Scientific  Writing :Research  DiscourseScientific  Writing :Research  Discourse
Scientific Writing :Research Discourse
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 

Introduction to data science

  • 1. UNIT 2 INTRODUCTION TO DATA SCIENCE Introduction • Data science deals with the concept of extracting useful knowledge from huge data to solve business problems by following a process • Data science includes data analysis as an important component
  • 2. Components of data Science • Statistics: Used to collect and analyze the numerical data in large amount to find meaningful insights. • Visualization: Representing data in visual context to understand the data. • Data Engineering: This includes acquiring, storing, retrieving and transforming data. • Advanced computing: This includes designing, writing, debugging and maintaining the source code of computers. • Machine learning: Providing training to the machines.
  • 3. Advantages of Data Science • Faster and better decision making • Improves marketing and sales • Selection of CV’s . Recruitment process easier. • Reaching customers
  • 4. Disadvantage of Data Science • Information can be misused. • Tools used for data science and analysis are expensive. • Tools are complex to understand
  • 5. Application of Data Science • Fraud and risk detection • Health care • Virtual assistance for patients and customer support • Internet search • Website recommendation • Advanced image recognition • Speech recognition • Airline route planning • Gaming • Augmented reality
  • 6. Data Science Process Step 1: Frame the problem Step 2: Collect the raw data needed for your problem Step 3: Process the data for analysis Step 4: Explore the data Step 5: Perform in-depth analysis Step 6: Communicate results of the analysis
  • 7. Basic of Data Analysis • Data analytics is the science of examining raw data with the purpose of drawing conclusions about that information. • Data Analytics is a process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, and supporting decision making.
  • 8. What is Analytics • Data: it is raw unorganized , • “data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum (singular of data) is a single value of a single variable.” • Information: when we analyze raw data it provides some sort of understanding called information
  • 9. • Data analytics is the science of analyzing raw data in order to make conclusions about that information. Many of the techniques and processes of data analytics have been automated into mechanical processes and algorithms that work over raw data for human consumption.
  • 10. • The process involved in data analysis involves several different steps: • The first step is to determine the data requirements or how the data is grouped. Data may be separated by age, demographic, income, or gender. Data values may be numerical or be divided by category. • The second step in data analytics is the process of collecting it. This can be done through a variety of sources such as computers, online sources, cameras, environmental sources, or through personnel. • Once the data is collected, it must be organized so it can be analyzed. Organization may take place on a spreadsheet or other form of software that can take statistical data. • The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure there is no duplication or error, and that it is not incomplete. This step helps correct any errors before it goes on to a data analyst to be analyzed.
  • 11. Phase 1: Data Discovery and Formation Phase 2: Data Preparation and Processing Phase 3: Design a Model Phase 4: Model Building Phase 5: Result Communication and Publication Phase 6: Measuring of Effectiveness
  • 12. Phase 1: Data Discovery and Formation • Everything begins with a defined goal. In this phase, you’ll define your data’s purpose and how to achieve it by the time you reach the end of the data analytics lifecycle. • Essential activities in this phase include structuring the business problem in the form of an analytics challenge and formulating the initial hypotheses (IHs) to test and start learning the data. The subsequent phases are then based on achieving the goal that is drawn in this stage.
  • 13. Phase 2: Data Preparation and Processing • This stage consists of everything that has anything to do with data. In phase 2, the attention of experts moves from business requirements to information requirements. • The data preparation and processing step involve collecting, processing, and cleansing the accumulated data. Data is collected using the below methods: • Data Acquisition: Accumulating information from external sources. • Data Entry: Formulating recent data points using digital systems or manual data entry techniques within the enterprise. • Signal Reception: Capturing information from digital devices, such as control systems and the Internet of Things.
  • 14. Phase 3: Design a Model • After mapping out your business goals and collecting a glut of data (structured, unstructured, or semi-structured), it is time to build a model that utilizes the data to achieve the goal. • There are several techniques available to load data into the system and start studying it: • ETL (Extract, Transform, and Load) transforms the data first using a set of business rules, before loading it into a sandbox. • ELT (Extract, Load, and Transform) first loads raw data into the sandbox and then transform it. • ETLT (Extract, Transform, Load, Transform) is a mixture; it has two transformation levels.
  • 15. Phase 4: Model Building • This step of data analytics architecture comprises developing data sets for testing, training, and production purposes. The data analytics experts meticulously build and operate the model that they had designed in the previous step. • They rely on tools and several techniques like decision trees, regression techniques and neural networks for building and executing the model. The experts also perform a trial run of the model to observe if the model corresponds to the datasets.
  • 16. Phase 5: Result Communication and Publication • Now is the time to check if those criteria are met by the tests you have run in the previous phase. • The communication step starts with a collaboration with major stakeholders to determine if the project results are a success or failure. The project team is required to identify the key findings of the analysis, measure the business value associated with the result, and produce a narrative to summarise and convey the results to the stakeholders.
  • 17. Phase 6: Measuring of Effectiveness • The final step is to provide a detailed report with key findings, coding, briefings, technical papers/ documents to the stakeholders. • Additionally, to measure the analysis’s effectiveness, the data is moved to a live environment from the sandbox and monitored to observe if the results match the expected business goal. If the findings are as per the objective, the reports and the results are finalized. However, suppose the outcome deviates from the intent set out in phase 1then. You can move backward in the data analytics lifecycle to any of the previous phases to change your input and get a different output.
  • 19. Descriptive analytics • What happened? • What is happening? • Descriptive analytics answers the question of what happened. • Descriptive analytics juggles raw data from multiple data sources to give valuable insights into the past. However, these findings simply signal that something is wrong or right, without explaining why. For this reason, our data consultants don’t recommend highly data-driven companies to settle for descriptive analytics only, they’d rather combine it with other types of data analytics. • An examples of this could be a monthly profit and loss statement
  • 20. Diagnostic analytics • At this stage, historical data can be measured against other data to answer the question of why something happened. • For example, if you’re conducting a social media marketing campaign, you may be interested in assessing the number of likes, reviews, mentions, followers or fans. Diagnostic analytics can help you distill thousands of mentions into a single view so that you can make progress with your campaign. • Diagnostic analytics gives in-depth insights into a particular problem.
  • 21. Predictive analytics • Predictive analytics tells what is likely to happen • Predictive analytics is the use of data, machine learning techniques, and statistical algorithms to determine the likelihood of future results based on historical data. The primary goal of predictive analytics is to help you go beyond just what has happened and provide the best possible assessment of what is likely to happen in future. • Predictive analytics can be used in banking systems to detect fraud cases, measure the levels of credit risks, and maximize the cross-sell and up-sell opportunities in an organization. This helps to retain valuable clients to your business.
  • 22. Prescriptive analytics • The purpose of prescriptive analytics is to literally prescribe what action to take to eliminate a future problem or take full advantage of a promising trend.
  • 23. Statistical Inference Statistical inference is the process of using data analysis to deduce properties of an underlying distribution of probability. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.
  • 24. Statistical Estimation • An estimator is a statistical parameter that provides an estimation of a population parameter. • The sample mean, is a point estimator for the population mean, . • Example: The mean of the age of men attending a show is 32 years.
  • 25. Statistical Hypothesis testing • Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis. Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data
  • 26. Population and sample • A population is the entire group that you want to draw conclusions about. • A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population. • In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc.
  • 27. Reasons for sampling Necessity: Sometimes it’s simply not possible to study the whole population due to its size or inaccessibility. Practicality: It’s easier and more efficient to collect data from a sample. Cost-effectiveness: There are fewer participant, laboratory, equipment, and researcher costs involved. Manageability: Storing and running statistical analyses on smaller datasets is easier and reliable.
  • 28. Statistical modeling • Statistical modeling is the process of applying statistical analysis to a dataset. A statistical model is a mathematical representation (or mathematical model) of observed data. • When data analysts apply various statistical models to the data they are investigating, they are able to understand and interpret the information more strategically. • “When you analyze data, you are looking for patterns,”
  • 29. steps of statistical model building process Model Selection • Based on the defined goal(s) (supervised or unsupervised) we have to select one of or combinations of modeling techniques. Such as • General linear model • Non-Linear Regression • Linear Regression • Ridge Regression • Non-Negative Garrotte Regression • Percentage Regression • Quantile Regression • Non-parametric regression • Logistic Regression • Probit Regression • Classification/Decision Trees • Random Forest
  • 30. • Support Vector Machine (SVM) • Distance metric learning • Bayesian methods • Graphical Models • Neural Networks • Genetic Algorithm • The Hazard and Survival Functions • Time Series Models • Signal Processing • Clustering Techniques • Market Basket Analysis • Frequent Itemset Mining • Association Rule Mining etc.
  • 31. Build/Develop/Train Models/Model fitting • Validate the assumptions of the chosen algorithm • Check for Redundancies of Independent Variables (Features). Sometime in Machine Learning, we are keen on accuracies of the models and hence we may not perform these checks! • Develop/Train Model on Training Sample, which is 80%/70%/60%/50% of the available data(Population) • Check Model performance - Error, Accuracy
  • 32. • Validate/Test Models • Score and Predict using Test Sample • Check for the robustness and stability of the model • Check Model Performance: Accuracy, ROC, AUC, KS, GINI etc. • AUC (Area Under The Curve) • ROC (Receiver Operating Characteristics) curve
  • 33. Probability • Probability theory, a branch of mathematics concerned with the analysis of random phenomena. The outcome of a random event cannot be determined before it occurs, but it may be any one of several possible outcomes. The actual outcome is considered to be determined by chance. • The set of all possible outcomes of an experiment is called a “sample space.” • The experiment of tossing a coin once results in a sample space with two possible outcomes, “heads” and “tails.” • Tossing two dice has a sample space with 36 possible outcomes
  • 34. Probability and data science • Randomness and uncertainty are imperative in the world and thus, it can prove to be immensely helpful to understand and know the chances of various events. Learning of probability helps you in making informed decisions about likelihood of events, based on a pattern of collected data. • In the context of data science, statistical inferences are often used to analyze or predict trends from data, and these inferences use probability distributions of data. Thus, your efficacy of working on data science problems depends on probability and its applications to a good extent.
  • 35. Probability distribution • Probability distribution is a function that describes all the possible likelihoods and values that can be taken by a random variable within a given range. For a continuous random variable, the probability distribution is described by the probability density function. And for a discrete random variable, it’s a probability mass function that defines the probability distribution. • Probability distributions are categorized into different classifications like binomial distribution, chi-square distribution, normal distribution, Poisson distribution etc. Different probability distributions represent different data generation process and cater to different purposes. For instance, the binomial distribution evaluates the probability of a particular event occurring many times over a given number of trials as well as given the probability of the event in each trial. The normal distribution is symmetric about the mean, demonstrating that the data closer to the mean are more recurrent in occurrence compared to the data far from the mean.
  • 36. Discrete Probability distribution • Binomial Distribution A binomial distribution can be thought of as simply the probability of a SUCCESS or FAILURE outcome in an experiment or survey that is repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: Pass or fail.
  • 37. • Geometric Distribution • The probability distribution of the number X of trials needed to get one success, supported on the set { 1, 2, 3, ... } • For example, suppose an ordinary die is thrown repeatedly until the first time a "1" appears. The probability distribution of the number of times it is thrown is supported on the infinite set { 1, 2, 3, ... } and is a geometric distribution with p = 1/6.
  • 38. • Poisson Distribution • Is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume. • examples that may follow a Poisson distribution include the number of phone calls received by a call center per hour and the number of decay events per second from a radioactive source.
  • 39. continuous Probability distribution • Uniform Distribution • Normal Distribution
  • 40. Uniform Distribution • In statistics, uniform distribution is a term used to describe a form of probability distribution where every possible outcome has an equal likelihood of happening. The probability is constant since each variable has equal chances of being the outcome.
  • 41. Normal Distribution • The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. • For example, heights, blood pressure IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.
  • 42.
  • 43. Correlation • Correlation is Positive when the values increase together, and • Correlation is Negative when one value decreases as the other increases
  • 46. Regression Analysis • Statement of the problem under consideration • Choice of relevant variables • Collection of data on relevant variables • Specification of model • Choice of method for fitting the data • Fitting of model • Model validation and criticism • Using the chosen model for the solution
  • 47. Application of Regression Analysis • Predictive Analytics • Operation efficiency • Supporting decisions • Correcting errors • New insights