SlideShare ist ein Scribd-Unternehmen logo
1 von 40
Downloaden Sie, um offline zu lesen
Statistics for
Data Scientists
Agenda
Revision
Data
Statistics -Descriptive, Central Tendency, Variation, Distributions
Data Mining
Basics of Data Science
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
the culture of academia, which does not reward researchers for understanding technology.
DANGER ZONE- this overlap of skills gives people the ability to create what appears to be
a legitimate analysis without any understanding of how they got there or
what they have created
Being able to manipulate text files at the command-line,
understanding vectorized operations, thinking algorithmically;
these are the hacking skills that make for a successful data hacker.
data plus math and statistics only gets you machine learning,
which is great if that is what you are interested in, but not if you are doing data science
What is Business Analytics
Definition – study of business data using statistical techniques and
programming for creating decision support and insights for achieving
business goals
Predictive- To predict the future.
Descriptive- To describe the past.
Data
Data is a set of values of qualitative or quantitative variables. An example of qualitative
data would be an anthropologist's handwritten notes about her interviews. data is
collected by a huge range of organizations and institutions, including businesses (e.g.,
sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment
rates, literacy rates) and non-governmental organizations (e.g., censuses of the number
of homeless people by non-profit organizations). Data is measured, collected and
reported, and analyzed, whereupon it can be visualized using graphs, images or other
analysis tools.
https://en.wikipedia.org/wiki/Data
Data is distinct pieces of information, usually formatted in a special way. All software is
divided into two general categories: data and programs . Programs are collections of
instructions for manipulating data.Data can exist in a variety of forms -- as numbers or
text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored
in a person's mind.
http://www.webopedia.com/TERM/D/data.html
Data
https://en.oxforddictionaries.com/definition/data Definition of data in English:
data
noun
[mass noun] Facts and statistics collected together for reference or analysis:
‘there is very little data available’
The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted
in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media.
Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.
Variable
Something that varies
Variable
Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or
ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal
variables are variables that have two or more categories, but which do not have an intrinsic order.
Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a
numerical value (for example, temperature measured in degrees Celsius or Fahrenheit).
Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that
variable. a distance of ten metres is twice the distance of 5 metres.
https://statistics.laerd.com/statistical-guides/types-of-variable.php
.
Central Tendency
Mean
Arithmetic Mean- the sum of the values divided by the number of values.
The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and
not their sum (as is the case with the arithmetic mean) e.g. rates of growth.
Median
the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower
hal
Mode-
The "mode" is the value that occurs most often.
Dispersion
Range
the range of a set of data is the difference between the largest and smallest values.
Variance
mean of squares of differences of values from mean
Standard Deviation
square root of its variance
Frequency
a frequency distribution is a table that displays the frequency of various outcomes in a sample.
Distribution
The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of
the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of
individuals in each group.
http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/
Distributions
Normal
The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
Skewed Distribution
Skewed Distribution
skewness is a measure of
the asymmetry of the
probability distribution of a
real-valued random variable
about its mean. The
skewness value can be
positive or negative, or even
undefined.
Image
https://en.wikipedia.org/wiki/F
ile:Negative_and_positive_sk
ew_diagrams_(English).svg
Skewed Distribution
kurtosis is a measure of the
"tailedness" of the probability distribution
of a real-valued random variable. kurtosis
is a descriptor of the shape of a probability
distribution
Image
http://www.itl.nist.gov/div898/handbook/eda/
section3/eda35b.htm
Skewed Distribution
skewness
returns value of
skewness,
kurtosis
returns value of kurtosis,
https://cran.r-project.org/
web/packages/moments
/moments.pdf
Image
http://www.janzengroup.
net/stats/lessons/descrip
tive.html
Distributions
Bernoulli
Distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It
can be used, for example, to represent the toss of a coin
Distributions
Chi Square
the distribution of a sum of the squares of k independent standard normal random variables.
Distributions
Poisson
a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time
and/or space if these events occur with a known average rate and independently of the time since the last event
Probability
Probability Distribution
The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important
continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area
under the curve.
Refresher in Statistics
Using RCmdr for Statistics
Using RCmdr for Statistics
Using RCmdr for Statistics
Using RCmdr
Central Limit Theorem
Central Limit Theorem -
In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently
large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will
be approximately normally distributed, regardless of the underlying distribution.
Hypothesis testing
Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The
usual process of hypothesis testing consists of four steps.
1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the
alternative hypothesis (commonly, that the observations show a real effect combined with a component of
chance variation).
2. Identify a test statistic that can be used to assess the truth of the null hypothesis.
3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed
would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the
evidence against the null hypothesis.
4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the
observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is
valid.
http://mathworld.wolfram.com/HypothesisTesting.html
Hypothesis testing
http://cmapskm.ihmc.us/rid=1052458963987_678930513_8647/Hypothesis%20testing.cmap
Hypothesis testing
Hypothesis testing
Hypothesis testing
T test
http://statistics.berkeley.edu/computing/r-t-tests
> x = rnorm(10)
> y = rnorm(10)
> t.test(x,y)
> ttest = t.test(x,y)
> names(ttest)
> ttest$statistic
Chi Square Distribution
Problem
Find the 95th
percentile of the Chi-Squared distribution with 7 degrees of freedom.
Solution
We apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95.
> qchisq(.95, df=7) # 7 degrees of freedom
[1] 14.067
http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution
Normal Distribution
we are looking for the percentage of students scoring
higher than 84 , we apply the function pnorm of the normal
distribution with mean 72 and standard deviation 15.2. We
are interested in the upper tail of the normal distribution.
> pnorm(84, mean=72, sd=15.2, lower.tail=FALSE)
[1] 0.21492
Student T Distribution
Problem
Find the 2.5th
and 97.5th
percentiles of the Student t distribution with 5 degrees of freedom.
Solution
We apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975.
> qt(c(.025, .975), df=5) # 5 degrees of freedom
[1] -2.5706 2.5706
Some code
http://rpubs.com/newajay/stats1
Some code
http://rpubs.com/newajay/stats4
Bayes Theorem
https://artax.karlin.mff.cuni.cz/r-help/library/LaplacesDemon/html/BayesTheorem.html
Bayes Theorem
https://en.wikipedia.org/wiki/Bayes'_theorem

Weitere ähnliche Inhalte

Was ist angesagt?

3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...PROWEBSCRAPER
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)Learnbay Datascience
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisGramener
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.Megha Sharma
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics toolsNascenia IT
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with RDr Nisha Arora
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis Peter Reimann
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With REdureka!
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 

Was ist angesagt? (20)

Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...3 pillars of big data : structured data, semi structured data and unstructure...
3 pillars of big data : structured data, semi structured data and unstructure...
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Pca
PcaPca
Pca
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
 
Linear Regression With R
Linear Regression With RLinear Regression With R
Linear Regression With R
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 

Ähnlich wie Statistics for data scientists

B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2marshalkalra
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docxscet315
 
Review of Basic Statistics and Terminology
Review of Basic Statistics and TerminologyReview of Basic Statistics and Terminology
Review of Basic Statistics and Terminologyaswhite
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptxcsecem
 
Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1TimKasse
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCsharondabriggs
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research ReportDrMAlagupriyasafiq
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data ProcessingDrMAlagupriyasafiq
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONNandakumar P
 
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptxMMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptxPETTIROSETALISIC
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 

Ähnlich wie Statistics for data scientists (20)

Data science
Data scienceData science
Data science
 
Data Science 1.pdf
Data Science 1.pdfData Science 1.pdf
Data Science 1.pdf
 
B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2B409 W11 Sas Collaborative Stats Guide V4.2
B409 W11 Sas Collaborative Stats Guide V4.2
 
UNIT-4.docx
UNIT-4.docxUNIT-4.docx
UNIT-4.docx
 
Review of Basic Statistics and Terminology
Review of Basic Statistics and TerminologyReview of Basic Statistics and Terminology
Review of Basic Statistics and Terminology
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptx
 
Data What Type Of Data Do You Have V2.1
Data   What Type Of Data Do You Have V2.1Data   What Type Of Data Do You Have V2.1
Data What Type Of Data Do You Have V2.1
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Statistics
StatisticsStatistics
Statistics
 
Data Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisCData Mining StepsProblem Definition Market AnalysisC
Data Mining StepsProblem Definition Market AnalysisC
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHONUNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
 
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptxMMW (Data Management)-Part 1 for ULO 2 (1).pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
 
Presentation of BRM.pptx
Presentation of BRM.pptxPresentation of BRM.pptx
Presentation of BRM.pptx
 
Descriptive Analytics: Data Reduction
 Descriptive Analytics: Data Reduction Descriptive Analytics: Data Reduction
Descriptive Analytics: Data Reduction
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 

Mehr von Ajay Ohri

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay OhriAjay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to RAjay Ohri
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionAjay Ohri
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for freeAjay Ohri
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10Ajay Ohri
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri ResumeAjay Ohri
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...Ajay Ohri
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessAjay Ohri
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data ScientistsAjay Ohri
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in PythonAjay Ohri
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen OomsAjay Ohri
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsAjay Ohri
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha Ajay Ohri
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanishAjay Ohri
 

Mehr von Ajay Ohri (20)

Introduction to R ajay Ohri
Introduction to R ajay OhriIntroduction to R ajay Ohri
Introduction to R ajay Ohri
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Social Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 ElectionSocial Media and Fake News in the 2016 Election
Social Media and Fake News in the 2016 Election
 
Pyspark
PysparkPyspark
Pyspark
 
Download Python for R Users pdf for free
Download Python for R Users pdf for freeDownload Python for R Users pdf for free
Download Python for R Users pdf for free
 
Install spark on_windows10
Install spark on_windows10Install spark on_windows10
Install spark on_windows10
 
Ajay ohri Resume
Ajay ohri ResumeAjay ohri Resume
Ajay ohri Resume
 
National seminar on emergence of internet of things (io t) trends and challe...
National seminar on emergence of internet of things (io t)  trends and challe...National seminar on emergence of internet of things (io t)  trends and challe...
National seminar on emergence of internet of things (io t) trends and challe...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Tradecraft
Tradecraft   Tradecraft
Tradecraft
 
Software Testing for Data Scientists
Software Testing for Data ScientistsSoftware Testing for Data Scientists
Software Testing for Data Scientists
 
Craps
CrapsCraps
Craps
 
A Data Science Tutorial in Python
A Data Science Tutorial in PythonA Data Science Tutorial in Python
A Data Science Tutorial in Python
 
How does cryptography work? by Jeroen Ooms
How does cryptography work?  by Jeroen OomsHow does cryptography work?  by Jeroen Ooms
How does cryptography work? by Jeroen Ooms
 
Using R for Social Media and Sports Analytics
Using R for Social Media and Sports AnalyticsUsing R for Social Media and Sports Analytics
Using R for Social Media and Sports Analytics
 
Kush stats alpha
Kush stats alpha Kush stats alpha
Kush stats alpha
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Summer school python in spanish
Summer school python in spanishSummer school python in spanish
Summer school python in spanish
 

Kürzlich hochgeladen

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Kürzlich hochgeladen (20)

How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Statistics for data scientists

  • 2. Agenda Revision Data Statistics -Descriptive, Central Tendency, Variation, Distributions Data Mining
  • 3. Basics of Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram the culture of academia, which does not reward researchers for understanding technology. DANGER ZONE- this overlap of skills gives people the ability to create what appears to be a legitimate analysis without any understanding of how they got there or what they have created Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker. data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science
  • 4. What is Business Analytics Definition – study of business data using statistical techniques and programming for creating decision support and insights for achieving business goals Predictive- To predict the future. Descriptive- To describe the past.
  • 5. Data Data is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist's handwritten notes about her interviews. data is collected by a huge range of organizations and institutions, including businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations). Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs, images or other analysis tools. https://en.wikipedia.org/wiki/Data Data is distinct pieces of information, usually formatted in a special way. All software is divided into two general categories: data and programs . Programs are collections of instructions for manipulating data.Data can exist in a variety of forms -- as numbers or text on pieces of paper, as bits and bytes stored in electronic memory, or as facts stored in a person's mind. http://www.webopedia.com/TERM/D/data.html
  • 6. Data https://en.oxforddictionaries.com/definition/data Definition of data in English: data noun [mass noun] Facts and statistics collected together for reference or analysis: ‘there is very little data available’ The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. Philosophy Things known or assumed as facts, making the basis of reasoning or calculation.
  • 8. Variable Ordinal variables are variables that have two or more categories just like nominal variables only the categories can also be ordered or ranked eg Excellent- Horrible. Dichotomous variables are nominal variables which have only two categories or levels. Nominal variables are variables that have two or more categories, but which do not have an intrinsic order. Interval variables are variables for which their central characteristic is that they can be measured along a continuum and they have a numerical value (for example, temperature measured in degrees Celsius or Fahrenheit). Ratio variables are interval variables, but with the added condition that 0 (zero) of the measurement indicates that there is none of that variable. a distance of ten metres is twice the distance of 5 metres. https://statistics.laerd.com/statistical-guides/types-of-variable.php .
  • 9. Central Tendency Mean Arithmetic Mean- the sum of the values divided by the number of values. The geometric mean is an average that is useful for sets of positive numbers that are interpreted according to their product and not their sum (as is the case with the arithmetic mean) e.g. rates of growth. Median the median is the number separating the higher half of a data sample, a population, or a probability distribution, from the lower hal Mode- The "mode" is the value that occurs most often.
  • 10. Dispersion Range the range of a set of data is the difference between the largest and smallest values. Variance mean of squares of differences of values from mean Standard Deviation square root of its variance Frequency a frequency distribution is a table that displays the frequency of various outcomes in a sample.
  • 11. Distribution The distribution of a statistical data set (or a population) is a listing or function showing all the possible values (or intervals) of the data and how often they occur. When a distribution of categorical data is organized, you see the number or percentage of individuals in each group. http://www.dummies.com/education/math/statistics/what-the-distribution-tells-you-about-a-statistical-data-set/
  • 12. Distributions Normal The simplest case of a normal distribution is known as the standard normal distribution. This is a special case where μ=0 and σ=1,
  • 14. Skewed Distribution skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. Image https://en.wikipedia.org/wiki/F ile:Negative_and_positive_sk ew_diagrams_(English).svg
  • 15. Skewed Distribution kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. kurtosis is a descriptor of the shape of a probability distribution Image http://www.itl.nist.gov/div898/handbook/eda/ section3/eda35b.htm
  • 16. Skewed Distribution skewness returns value of skewness, kurtosis returns value of kurtosis, https://cran.r-project.org/ web/packages/moments /moments.pdf Image http://www.janzengroup. net/stats/lessons/descrip tive.html
  • 17. Distributions Bernoulli Distribution of a random variable which takes value 1 with success probability and value 0 with failure probability. It can be used, for example, to represent the toss of a coin
  • 18. Distributions Chi Square the distribution of a sum of the squares of k independent standard normal random variables.
  • 19. Distributions Poisson a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event
  • 20. Probability Probability Distribution The probability density function (pdf) of the normal distribution, also called Gaussian or "bell curve", the most important continuous random distribution. As notated on the figure, the probabilities of intervals of values correspond to the area under the curve.
  • 22. Using RCmdr for Statistics
  • 23. Using RCmdr for Statistics
  • 24. Using RCmdr for Statistics
  • 26. Central Limit Theorem Central Limit Theorem - In probability theory, the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independentrandom variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the underlying distribution.
  • 27. Hypothesis testing Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps. 1. Formulate the null hypothesis (commonly, that the observations are the result of pure chance) and the alternative hypothesis (commonly, that the observations show a real effect combined with a component of chance variation). 2. Identify a test statistic that can be used to assess the truth of the null hypothesis. 3. Compute the P-value, which is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis were true. The smaller the -value, the stronger the evidence against the null hypothesis. 4. Compare the -value to an acceptable significance value (sometimes called an alpha value). If , that the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. http://mathworld.wolfram.com/HypothesisTesting.html
  • 33. T test http://statistics.berkeley.edu/computing/r-t-tests > x = rnorm(10) > y = rnorm(10) > t.test(x,y) > ttest = t.test(x,y) > names(ttest) > ttest$statistic
  • 34. Chi Square Distribution Problem Find the 95th percentile of the Chi-Squared distribution with 7 degrees of freedom. Solution We apply the quantile function qchisq of the Chi-Squared distribution against the decimal values 0.95. > qchisq(.95, df=7) # 7 degrees of freedom [1] 14.067 http://www.r-tutor.com/elementary-statistics/probability-distributions/chi-squared-distribution
  • 35. Normal Distribution we are looking for the percentage of students scoring higher than 84 , we apply the function pnorm of the normal distribution with mean 72 and standard deviation 15.2. We are interested in the upper tail of the normal distribution. > pnorm(84, mean=72, sd=15.2, lower.tail=FALSE) [1] 0.21492
  • 36. Student T Distribution Problem Find the 2.5th and 97.5th percentiles of the Student t distribution with 5 degrees of freedom. Solution We apply the quantile function qt of the Student t distribution against the decimal values 0.025 and 0.975. > qt(c(.025, .975), df=5) # 5 degrees of freedom [1] -2.5706 2.5706