Weitere ähnliche Inhalte Ähnlich wie Statistical Analysis with R- III (20) Mehr von Akhila Prabhakaran (9) Kürzlich hochgeladen (20) Statistical Analysis with R- III2. PART 3
3. Probability distributions
3.1. Normal distribution
3.2. Chi-square distribution
3.3. Student’s t-distribution
3.4. Summary of applications of different distributions
3.5 Central Limit Theorem
© akhila prabhakaran
3. Probability Distributions
Recap
When the value of a variable is the outcome of a statistical experiment, that variable is
a random variable.
Sample Space = set of all possible outcomes of an experiment.
Event = subset of the Sample Space. (example coin toss)
S = sample space {all outcomes of the experiment}
= {e1, e2, e3, e4…..en}
Probability Distribution = {p1 = P(e1), p2 = P(e2)…….pn = P(en)}
© akhila prabhakaran
4. Population vs Sample
A population is a group of phenomena that have something in common. The term
often refers to a group of people, as in the following examples:
All registered voters in Bangalore
All members of the IEEE
All Cricketers who played atleast one league match in the past year
Populations can refer to things as well as people:
All sensors installed in a high security location.
All daily maximum temperatures in July for major Indian cities
All basal ganglia cells from a particular rhesus monkey
© akhila prabhakaran
5. Sample vs Population
A sample is a smaller group of
members of a population selected
to represent the population.
PARAMETER => Population
characteristic like population mean
etc.
STATISTIC => Sample characteristic
© akhila prabhakaran
6. Probability Distribution
© akhila prabhakaran
Experiment: Flip a coin two times.
All possible outcomes: HH, HT, TH, and TT.
Random variable X : Number of Heads that result from this experiment.
All possible values of X : 0, 1, or 2.
A probability distribution is a table or an equation that links each outcome of a statistical experiment
with its probability of occurrence.
Number of Heads
(X)
Probability [ P(X =x)]
0 0.25
1 0.50
2 0.25
7. Cumulative Probability Distribution
© akhila prabhakaran
Refers to the probability that the value of a random variable falls within a specified range.
Experiment: Flip a coin two times.
All possible outcomes: HH, HT, TH, and TT.
What is the probability that the coin flips would result in one or fewer heads?
P(X < 1) = P(X = 0) + P(X = 1) = 0.25 + 0.50 = 0.75
Number of Probability (X =x) Cumulative
Probability (X<=x)
0 0.25 0.25
1 0.50 0.75
2 0.25 1
8. UNIFORM Distribution
All of the values of a random variable occur with equal probability.
Suppose the random variable X can assume k different values.
Suppose also that the P(X = xk) is constant.
P(X = xk) = 1/k
Example : Suppose a dice is tossed. What is the probability that the die will land on 5?
6 possible outcomes represented by: S = { 1, 2, 3, 4, 5, 6 }.
Each possible outcome is a random variable (X), and each outcome is equally likely to occur. The
P(X = 5) = 1/6.
What is the probability that the dice will land on a number that is smaller than 5?
© akhila prabhakaran
9. Probability Distributions: Discrete or
Continuous
Depends on whether it is associated with Discrete variables or Continuous variables
Discrete data
When the values in the batch are whole numbers (counts), the data set is called discrete.
Examples of discrete measurements are:
Continuous data
When the data are not constrained to be whole numbers, the data set is called continuous.
Examples are:
the maximum temperatures each day in January in your local city,
© akhila prabhakaran
10. Discrete Probability Distributions
If a random variable is a discrete variable, its probability distribution is called a discrete probability
distribution.
Earlier example about flipping a coin and rolling a dice.
Binomial probability distribution
A binomial experiment is a statistical experiment that consists of n repeated trials. Each trial can
result in just two possible outcomes (success or failure). The probability of success, denoted by P,
is the same on every trial. The trials are independent; that is, the outcome on one trial does not
affect the outcome on other trials.
A binomial random variable is the number of successes x in n repeated trials of a binomial
experiment.
The probability distribution of a binomial random variable is called a binomial distribution.
© akhila prabhakaran
14. Applications of Binomial distribution
© akhila prabhakaran
In modeling the driver behavior, intersection turning movements, and in speed studies this
distribution is used.
For example, if the probability of a vehicle turning left at an intersection is 0.15 then the
probability of 3 vehicles out of 10 vehicles turning left equals to,
10C3 (0.15)3 (0.85)7 =0.130
In the above example, a specific vehicle turning left or not is a Bernoulli trial and it is assumed
that the arrivals of individual vehicles at the junction are independent events.
15. Applications of Binomial distribution
© akhila prabhakaran
A Biological Application of the Binomial Distribution
Suppose that 1% of the population is infected with a virus. There are no obvious symptoms that
can be used to recognise carriers, thus individuals must be selected at random and tested. A
decision is made to obtain a sample of 20 individuals.
Is this sample size adequate? Will any infected individuals be found?
If 1% of the population is infected then p = 0.01 (1% infected) and q = 0.99 (99% non-infected).
Picking an individual at random has only a 1% chance of an infection, but surely at least 1
infected person should be found in 20 individuals? In order to answer this question lateral
thinking is needed.
16. Applications of Binomial distribution
© akhila prabhakaran
A Biological Application of the Binomial Distribution
To find the probability of finding some (i.e. 1 or more) the easiest way is to calculate the
probability of no cases (i.e. P(0)) and then use subtraction.
The number of successes, r, to 0, and the number of trials, n, to 20. This will gives the probability
of taking a sample of 20 individuals and finding no infected individuals.
P(0) = 20C0 p0 q20
P(0) = 20!/((0!)(20-0)! x 0.010 x 0.9920 = 0.82
Thus, if 1% of the population is infected there is a 82% chance that a sample of 20 individuals
will fail to find any infections
17. Poisson Distribution
© akhila prabhakaran
Probability distribution that results from a Poisson experiment.
Attributes of a Poisson Experiment
• Outcomes that can be classified as successes or failures.
• Average number of successes (μ) that occurs in a specified region is known.
• Probability that a success will occur is proportional to the size of the region.
• The probability that a success will occur in an extremely small region is virtually zero.
• The specified region could take many forms. For instance, it could be a length, an
area, a volume, a period of time, etc.
20. Poisson Distribution Examples
© akhila prabhakaran
Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists
will see fewer than four lions on the next 1-day safari?
This is a Poisson experiment in which we know the following:
μ = 5; since 5 lions are seen per safari, on average.
x = 0, 1, 2, or 3;
Find the likelihood that tourists will see fewer than 4 lions; we want the probability that they will see 0,
1, 2, or 3 lions.
e = 2.71828; since e is a constant equal to approximately 2.71828.
We need to calculate the sum of four probabilities: P(0; 5) + P(1; 5) + P(2; 5) + P(3; 5).
22. Poisson Distribution
© akhila prabhakaran
If, from the past experience it is known that on an average every two years 3
cyclones hit the coastal area of Andhra Pradesh and Orissa states. If it is
assumed that the cyclone hitting the coastal areas follows Poisson
distribution then what is the probability of two cyclones crossing the coastal
area of Andhra Pradesh and Orissa in the next two years?
23. Poisson Distribution
© akhila prabhakaran
The most widely used situation is the arrival pattern of vehicles. In this
case m becomes the average number of vehicles per any stated time interval.
Queueing systems use poisson distribution or variations of this distribution,
extensively to understand and optimize queueing patterns/workflow.
24. Probability Density Function
© akhila prabhakaran
There are three basic differences between a continuous and a discrete probability distribution:
1. The probability that a continuous variable will take a specific value is equal to zero.
2. Because of this, we can never express continuous probability distribution in a tabular form.
3. Thus we require an equation or a formula to describe such kind of distribution. Such equation
is termed as probability density function.
33. Normal Distribution
© akhila prabhakaran
Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
The area under the normal curve is equal to 1.0.
Normal distributions are denser in the center and less dense in the tails.
Normal distributions are defined by two parameters, the mean (μ) and the
standard deviation (σ).
68% of the area of a normal distribution is within one standard deviation of the
mean.
Approximately 95% of the area of a normal distribution is within two standard
deviations of the mean.
35. Normal Distribution
© akhila prabhakaran
One of the first applications of the normal distribution was to the analysis of errors of measurement
made in astronomical observations, errors that occurred because of imperfect instruments and
imperfect observers.
Galileo in the 17th century noted that these errors were symmetric and that small errors occurred more
frequently than large errors.
This led to several hypothesized distributions of errors, but it was not until the early 19th century that it
was discovered that these errors followed a normal distribution.
Independently, the mathematicians Adrain in 1808 and Gauss in 1809 developed the formula for the
normal distribution and showed that errors were fit well by this distribution.
This same distribution had been discovered by Laplace in 1778 when he derived the extremely
important central limit theorem.
Laplace showed that even if a distribution is not normally distributed, the means of repeated samples
from the distribution would be very nearly normally distributed, and that the larger the sample size, the
closer the distribution of means would be to a normal distribution.
Most statistical procedures for testing differences between means assume normal distributions. These
tests work well even if the original distribution is only roughly normal.
Quételet was the first to apply the normal distribution to human characteristics. He noted that
characteristics such as height, weight, and strength were normally distributed.
36. Normal Distribution – Area under the
curve
© akhila prabhakaran
http://onlinestatbook.com/2/calculators/normal_dist.html
> pnorm(1, mean=0, sd=1)
[1] 0.8413447
> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-4,1,length=200)
> y=dnorm(x)
> polygon(c(-4,x,1),c(0,y,0),col="gray")
Interpretation of area as a probability
This result indicates that if we draw a number at
random from the standard normal distribution, the
probability that we draw a number that is less than or
equal to 1 is 0.8413447.
37. Normal Distribution: Area under the curve
© akhila prabhakaran
The probability that a randomly selected number from the standard normal distribution occurs
within one standard deviation of the mean.
This probability is represented by the area under the standard normal curve between x = -1
and x = 1
> pnorm(1, mean=0, sd=1)-pnorm(-1,mean-0, sd = 1)
[1] 0.6826895
> x=seq(-4,4,length=200)
> y=dnorm(x)
> plot(x,y,type="l", lwd=2, col="blue")
> x=seq(-1,1,length=100)
> y=dnorm(x)
> polygon(c(-1,x,1),c(0,y,0),col="gray")
38. Normal Distribution: Quantiles
© akhila prabhakaran
Given the probability (or area under the curve) find the x value.
What is the 95th percentile of a standard normal distribution?
> qnorm(0.95,mean=0,sd=1)
[1] 1.644854
Find all quantiles of the standard normal distribution.
Display pdfs of normal distributions with mean of 50 and with
standard deviations of 10 and 5 respectively.
Display pdfs of normal distributions with mean of 50 and 70
& standard deviations of 10 and 15 respectively
39. Sum of Normal Random Variables
© akhila prabhakaran
X and Y are Normally distributed random variables, that are independent
41. Degrees
of
Freedom
© akhila prabhakaran
The degrees of freedom (df) of an estimate is the number of
independent pieces of information on which the estimate is
based.
For example, an estimate of the variance based on a sample
size of 100 is based on more information than an estimate of
the variance based on a sample size of 5.
If we know that the mean height of Martians is 6 and wish to
estimate the variance of their heights. We randomly sample
one Martian and find that its height is 8.
Variance = (8-6)2 Has 1 degree of freedom
If we have the height of another Martian, say 9, The new
variance would be [(8-6)2 + (9-6)2] x 1/2 With 2 degrees of
freedom
Now, if we do not know the mean, the degrees of freedom
reduces by 1
43. What is inferential statistics?
© akhila prabhakaran
Generalizing from sample to population
A critical part of inferential statistics involves determining how far
sample statistics are likely to vary from each other and from the
population parameter.
These are determined based on Sampling Distributions.
44. What is a sampling distribution?
© akhila prabhakaran
A sampling distribution is a graph of a statistic for your sample data
Technically, you could choose any statistic to paint a picture, some common ones are:
• Mean
• Mean absolute value of the deviation from the mean
• Range
• Standard deviation of the sample
• Unbiased estimate of variance
• Variance of the sample
45. Sampling distributions
© akhila prabhakaran
• A set of three pool balls, each with a number on it.
• Two of the balls are selected randomly (with replacement) and the average of their
numbers is computed.
• Tabulate each outcome and its mean.
• Tabulate the frequencies of the mean of each outcome
48. EXERCISE : SAMPLING DISTRIBUTION OF
RANGE
© akhila prabhakaran
for(i in 1:10)
{
print(sample(c(1,2,3), 2,
replace = TRUE, prob = NULL))
}
49. Sampling distributions and inferential statistics
© akhila prabhakaran
s <- list()
for(i in 1:20)
{
l1 <-sample(SachinNoNAs$Runs, 2, replace = TRUE,
prob = NULL)
s <- append(s, mean(l1))
}
ggplot() + geom_histogram(aes(x = unlist(s)),
bins= 100, color = "white", fill = "blue")
#########################################
s <- list()
for(i in 1:100)
{
l1 <-sample(SachinNoNAs$Runs, 50, replace =
TRUE, prob = NULL)
s <- append(s, mean(l1))
}
ggplot() + geom_histogram(aes(x = unlist(s)),
bins= 100, color = "white", fill = "blue")
50. Normal Approximation to Binomial
© akhila prabhakaran
Assume you have a fair coin and
wish to know the probability that
you would get 8 heads out of 10
flips.
Using dbinom
dbinom(8,10,0.5)
#[1] 0.04394531
plot(dbinom(seq(1:100), 100,
0.5), col="red", pch=19)
51. Normal Approximation to Binomial
© akhila prabhakaran
Binomial distribution has a mean of μ = Np = (10)(0.5) = 5
and a variance of σ2 = Np(1-p) = (10)(0.5)(0.5) = 2.5
The standard deviation is therefore 1.5811.
A total of 8 heads is (8 - 5)/1.5811 = 1.897 standard deviations above
the mean of the distribution.
Solution: round off and consider any value from 7.5 to 8.5 to
represent an outcome of 8 heads. Using this approach, we figure out
the area under a normal curve from 7.5 to 8.5.
52. Central limit theorem
© akhila prabhakaran
Given a population with a finite mean μ and a finite non-zero variance σ2,
the sampling distribution of the mean approaches a normal distribution
with a mean of μ and a variance of σ2/N as N, the sample size, increases.
If a population has a mean μ, then the mean of the sampling
distribution of the mean is also μ.
μM = μ
The variance of the sampling distribution of the mean is
54. EXERCISE
© akhila prabhakaran
1. X = sum of two 6-faced dice. What is the sample space of X? Can you
simulate this using R? The experiment is performed N(=10,20,30) times.
What is the distribution of X. Plot a histogram.
2. Find the sampling distribution of the means of X.
3. What is the mean and variance of the sampling distribution?
56. Central limit theorem - Usage
© akhila prabhakaran
Three central limit theorem examples:
Find the probability that the mean is greater than a certain number
Find the probability that the mean is less than a certain number
Find the probability that the mean is between a certain set of numbers either
side of the mean
57. Central limit theorem - Usage
© akhila prabhakaran
Problem: A certain group of welfare recipients receives SNAP benefits of $110
per week with a standard deviation of $20. If a random sample of 25 people is
taken, what is the probability their mean benefit will be greater than $120 per
week?
The mean (average or μ)
The standard deviation (σ)
Sample size (n)
In other words, the problem is asking you “What is the probability that a
sample mean of x items will be greater than a given number?
58. Central limit theorem - Usage
© akhila prabhakaran
The mean (average or μ)
The standard deviation (σ)
Population size
Sample size (n)
In other words, the problem is asking you “What is the probability that a
sample mean of x items will be greater than a given number?
59. Central limit theorem - Usage
© akhila prabhakaran
Problem: A certain group of welfare recipients receives SNAP benefits of $110
per week with a standard deviation of $20. If a random sample of 25 people is
taken, what is the probability their mean benefit will be greater than $120 per
week?
X ~ mean of the random sample
To find P(X > $120)
X ~ N(110, 20/sqrt(25))
(X – 110)/4 ~ N(0,1)
Problem translates to P[(X-110)/4 > (120-110)/4] or P( Y > 2.5) where
Y~N(0,1)
1 - pnorm(2.5)
60. Central limit theorem - Usage
© akhila prabhakaran
Problem: A population of 29 year-old males has a mean salary of $29,321 with
a standard deviation of $2,120. If a sample of 100 men is taken, what is the
probability their mean salaries will be less than $29,000?
The mean (average or μ) = 29321
The standard deviation (σ) = 2120
Sample size (n) = 100
In other words, the problem is asking you “What is the probability that a
sample mean of 100 items will be less than a given number?
X ~ sample mean
Y = [(X – μ)/(σ/sqrt(n))] ~ N(0.1)
P (Y < [(29000 – μ)/(σ/sqrt(n))])= pnorm(-1.51)
61. Central limit theorem - Usage
© akhila prabhakaran
Problem: There are 250 dogs at a dog show who weigh an average of 12
pounds, with a standard deviation of 8 pounds. If 4 dogs are chosen at
random, what is the probability they have an average weight of greater than 8
pounds and less than 25 pounds?
The mean (average or μ) = 12
The standard deviation (σ) = 8
Sample size (n) = 4
In other words, the problem is asking you “What is the probability that a
sample mean of 4 items will be less than 25 and more than 8?
X ~ sample mean
Y = [(X – μ)/(σ/sqrt(n))] ~ N(0.1)
P ([(8 – μ)/(σ/sqrt(n))] < Y < [(25 – μ)/(σ/sqrt(n))])
62. Central limit theorem - Usage
© akhila prabhakaran
The mean (average or μ) = 12
The standard deviation (σ) = 8
Sample size (n) = 4
X ~ sample mean
Y = [(X – μ)/(σ/sqrt(n))] ~ N(0.1)
P ([(8 – μ)/(σ/sqrt(n))] < Y < [(25 – μ)/(σ/sqrt(n))])
P(-4/4 < Y < 13/4 )
= pnorm(3.5) + 1 – pnorm(-1)
63. Chi-square distribution
© akhila prabhakaran
If X is a standard normal random variable with mean μ and variance σ2 then X2 has a
Chi-square distribution with 1 degree of freedom.
If X1 ,X2 ,X3, ,X4 …… ,Xn are independent standard normal random variables with mean
μ and variance σ2 , then Y = X1
2 + X2
2 + X3
2 +…Xn
2 has a Chi-square distribution with
n degrees of freedom.
66. Chi-square distribution
© akhila prabhakaran
?chisquare
dchisq(x, df, ncp = 0, log = FALSE)
pchisq(q, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
qchisq(p, df, ncp = 0, lower.tail = TRUE, log.p = FALSE)
rchisq(n, df, ncp = 0)
plot(dchisq(seq(from = 0, to = 10, by = 0.005), df=1))
plot(dchisq(seq(from = 0, to = 10, by = 0.005), df=2))
plot(dchisq(seq(from = 0, to = 10, by = 0.005), df=3))
plot(dchisq(seq(from = 0, to = 10, by = 0.005), df=4))
67. Chi-square distribution
© akhila prabhakaran
Let X1 and X2 be two independent normal random variables having mean μ =0
and variance σ2 =16. Compute the following probability:
Let X be a chi-square random variable with 3 degrees of freedom.
Compute the following probability:
pchisq(7.81, df = 3) – pchisq(0.35, df = 3)
68. Student’s T - Distribution
© akhila prabhakaran
X1, ..., Xn are independent and identically distributed as N(μ, σ2), i.e. this is a sample
of size n from a normally distributed population with expected mean value μ and
variance σ2.
Sample Mean Sample Variance
Has a standard normal distribution
Has a Students T distribution with n-1 degrees of
freedom
69. Student’s T - Distribution
© akhila prabhakaran
Properties of the t Distribution
The mean of the distribution is equal to 0 .
The variance is equal to n / ( n - 2 ), where v is the degrees of
freedom and n > 2.
The variance is always greater than 1, although it is close to 1 when
there are many degrees of freedom.
With infinite degrees of freedom, the t distribution is the same as the
standard normal distribution.
70. Student’s T - Distribution
© akhila prabhakaran
?tdist
dt(x, df, ncp, log = FALSE)
pt(q, df, ncp, lower.tail = TRUE, log.p = FALSE)
qt(p, df, ncp, lower.tail = TRUE, log.p = FALSE)
rt(n, df, ncp)
Exercise: Plot probability density function of students T distribution for 1 to 10
degrees of freedom
Hinweis der Redaktion All probability distributions can be classified as discrete probability distributions or as continuous probability distributions, depending on whether they define probabilities associated with discrete variables or continuous variables.
the number of admissions in a hospital's accident and emergency unit each day over a period of two months,
the number of people in each household in a survey of 10,000 households,
http://stattrek.com/probability-distributions/binomial.aspx
This has several applications in other fields of civil engineering, such as the probability of occurrence of peak floods greater than the design peak flood in a particular time period, probability of peak ground acceleration exceeding certain design value in a given time interval etc.
http://stattrek.com/probability-distributions/binomial.aspx
This has several applications in other fields of civil engineering, such as the probability of occurrence of peak floods greater than the design peak flood in a particular time period, probability of peak ground acceleration exceeding certain design value in a given time interval etc.
http://stattrek.com/probability-distributions/binomial.aspx
This has several applications in other fields of civil engineering, such as the probability of occurrence of peak floods greater than the design peak flood in a particular time period, probability of peak ground acceleration exceeding certain design value in a given time interval etc.
The Standard Normal curve, shown here, has mean 0 and standard deviation 1. If a dataset follows a normal distribution, then about 68% of the observations will fall within of the mean , which in this case is with the interval (-1,1). About 95% of the observations will fall within 2 standard deviations of the mean, which is the interval (-2,2) for the standard normal, and about 99.7% of the observations will fall within 3 standard deviations of the mean, which corresponds to the interval (-3,3) in this case. Although it may appear as if a normal distribution does not include any values beyond a certain interval, the density is actually positive for all values, . Data from any normal distribution may be transformed into data following the standard normal distribution by subtracting the mean and dividing by the standard deviation . you can use it to find the proportion of a normal distribution with a mean of 90 and a standard deviation of 12 that is above 110. Set the mean to 90 and the standard deviation to 12. Then enter "110" in the box to the right of the radio button "Above." At the bottom of the display you will see that the shaded area is 0.0478. See if you can use the calculator to find that the area between 115 and 120 is 0.0124 you can use it to find the proportion of a normal distribution with a mean of 90 and a standard deviation of 12 that is above 110. Set the mean to 90 and the standard deviation to 12. Then enter "110" in the box to the right of the radio button "Above." At the bottom of the display you will see that the shaded area is 0.0478. See if you can use the calculator to find that the area between 115 and 120 is 0.0124 you can use it to find the proportion of a normal distribution with a mean of 90 and a standard deviation of 12 that is above 110. Set the mean to 90 and the standard deviation to 12. Then enter "110" in the box to the right of the radio button "Above." At the bottom of the display you will see that the shaded area is 0.0478. See if you can use the calculator to find that the area between 115 and 120 is 0.0124 Tail risk can be evaluated by assuming a normal distribution and computing the probability of such an event. Is that how "tail risk" should be evaluated?
http://onlinestatbook.com/2/normal_distribution/ch6_exercises.html
http://rpubs.com/Lionel/11497 http://stattrek.com/probability-distributions/t-distribution.aspx