SlideShare ist ein Scribd-Unternehmen logo
1 von 19
STATISTICS BASICS
BBA CENTERED
yash sadrani
RK UNIVERSITY RAJKOT
1
CONTENTS
1. Hypothesis
2. Null hypothesis
3. Regression
4. cor•re•la•tion
5. Exponential Distribution
6. Alternative hypothesis
7. Central tendency
8. Central tendency
9. Bayes' theorem
10. Chebyshev’s Theorem
11. Simple random sampling
12. Descriptive statistics
13. Statistical inference
14. Characteristics of good estimator
15. properties of the test for
independence
2
16. Utility of regression studies
17. Advantages of Sample Surveys
18. The hypergeometric distribution
19. leptokurtic distribution
20. the interquartile range
3
Hypothsys
When a possible correlation or similar relation between phenomena is investigated, such as, for
example, whether a proposed remedy is effective in treating a disease, that is, at least to some extent
and for some patients, the hypothesis that a relation exists cannot be examined the same way one
might examine a proposed new law of nature: in such an investigation a few cases in which the tested
remedy shows no effect do not falsify the hypothesis. Instead, statistical tests are used to determine
how likely it is that the overall effect would be observed if no real relation as hypothesized exists. If that
likelihood is sufficiently small (e.g., less than 1%), the existence of a relation may be assumed.
Otherwise, any observed effect may as well be due to pure chance.
In statistical hypothesis testing two hypotheses are compared, which are called the null hypothesis and
the alternative hypothesis. The null hypothesis is the hypothesis that states that there is no relation
between the phenomena whose relation is under investigation, or at least not of the form given by the
alternative hypothesis. The alternative hypothesis, as the name suggests, is the alternative to the null
hypothesis: it states that there is some kind of relation. The alternative hypothesis may take several
forms, depending on the nature of the hypothesized relation; in particular, it can be two-sided (for
example: there is some effect, in a yet unknown direction) or one-sided (the direction of the
hypothesized relation, positive or negative, is fixed in advance).
Conventional significance levels for testing the hypotheses are .10, .05, and .01. Whether the null
hypothesis is rejected and the alternative hypothesis is accepted, all must be determined in advance,
before the observations are collected or inspected. If these criteria are determined later, when the data
to be tested is already known, the test is invalid.
It is important to mention that the above procedure is actually dependent on the number of the
participants (units or sample size) that is included in the study. For instance, the sample size may be too
small to reject a null hypothesis and, therefore, is recommended to specify the sample size from the
beginning. It is advisable to define a small, medium and large effect size for each of a number of the
important statistical tests which are used to test the hypotheses.
A statistical hypothesis test is a method of statistical inference using data from a scientific study. In
statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by
chance alone, according to a pre-determined threshold probability, the significance level. The phrase
"test of significance" was coined by statistician Ronald Fisher.[1] These tests are used in determining
what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of
significance; this can help to decide whether results contain enough information to cast doubt
4
onconventional wisdom, given that conventional wisdom has been used to establish the null hypothesis.
The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be
rejected in favor of the alternative hypothesis. Statistical hypothesis testing is sometimes called
confirmatory data analysis, in contrast to exploratory data analysis, which may not have pre-specified
hypotheses.
Example 1 – Philosopher's beans
The following example was produced by a philosopher describing scientific methods generations before
hypothesis testing was formalized and popularized.
Few beans of this handful are white.
Most beans in this bag are white.
Therefore: Probably, these beans were taken from another bag.
This is an hypothetical inference.
The beans in the bag are the population. The handful are the sample. The null hypothesis is that the
sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious"
difference in appearance (an informal difference in the mean). The interesting result is that
consideration of a real population and a real sample produced an imaginary bag. The philosopher was
considering logic rather than probability. To be a real statistical hypothesis test, this example requires
the formalities of a probability calculation and a comparison of that probability to a standard.
A simple generalization of the example considers a mixed bag of beans and a handful that contain either
very few or very many white beans. The generalization considers both extremes. It requires more
calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged;
If the composition of the handful is greatly different that of the bag, then the sample probably
originated from another bag. The original example is termed a one-sided or a one-tailed test while the
generalization is termed a two-sided or two-tailed test.
5
Null hypothesis
In statistical inference of observed data of a scientific experiment, the null hypothesis refers to a general
or default position: that there is no relationship between two measured phenomena,[1] or that a
potential medical treatment has no effect.[2] Rejecting or disproving the null hypothesis – and thus
concluding that there are grounds for believing that there is a relationship between two phenomena or
that a potential treatment has a measurable effect – is a central task in the modern practice of science,
and gives a precise sense in which a claim is capable of being proven false.
In statistical significance, the null hypothesis is often denoted H0 (read “H-naught”), is generally
assumed true until evidence indicates otherwise (e.g., H0: μ = 500 hours). The concept of a null
hypothesis is used differently in two approaches to statistical inference, though, problematically, the
same term is used. In the significance testing approach of Ronald Fisher, a null hypothesis is potentially
rejected or disproved on the basis of data that is significantly under its assumption, but never accepted
or proved. In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is
contrasted with an alternative hypothesis, and these are decided between on the basis of data, with
certain error rates. These two approaches criticized each other, though today a hybrid approach is
widely practiced and presented in textbooks. This hybrid is in turn criticized as incorrect and incoherent
– see statistical hypothesis testing. Statistical significance plays a pivotal role in statistical hypothesis
testing where it is used to determine if a null hypothesis can be rejected or retained.
regression
In statistics, regression analysis is a statistical process for estimating the relationships among variables. It
includes many techniques for modeling and analyzing several variables, when the focus is on the
relationship between a dependent variable and one or more independent variables. More specifically,
regression analysis helps one understand how the typical value of the dependent variable (or 'Criterion
Variable') changes when any one of the independent variables is varied, while the other independent
variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of
the dependent variable given the independent variables – that is, the average value of the dependent
variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other
location parameter of the conditional distribution of the dependent variable given the independent
variables. In all cases, the estimation target is a function of the independent variables called the
regression function. In regression analysis, it is also of interest to characterize the variation of the
dependent variable around the regression function which can be described by a probability distribution.
1. (Psychology) psychol the adoption by an adult or adolescent of behaviour more appropriate to a child,
esp as a defence mechanism to avoid anxiety
2. (Statistics) statistics
6
a. the analysis or measure of the association between one variable (the dependent variable) and one or
more other variables (the independent variables), usually formulated in an equation in which the
independent variables have parametric coefficients, which may enable future values of the dependent
variable to be predicted
b. (as modifer): regression curve.
3. (Astronomy) astronomy the slow movement around the ecliptic of the two points at which the
moon's orbit intersects the ecliptic. One complete revolution occurs about every 19 years
4. (Geological Science) geology the retreat of the sea from the land
5. (Statistics) the act of regressing
6. (Logic) the act of regressing
cor·re·la·tion
In statistics, dependence is any statistical relationship between two random variables or two sets of
data. Correlation refers to any of a broad class of statistical relationships involving dependence.
Familiar examples of dependent phenomena include the correlation between the physical statures of
parents and their offspring, and the correlation between the demand for a product and its price.
Correlations are useful because they can indicate a predictive relationship that can be exploited in
practice. For example, an electrical utility may produce less power on a mild day based on the
correlation between electricity demand and weather. In this example there is a causal relationship,
because extreme weather causes people to use more electricity for heating or cooling; however,
statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e.,
correlation does not imply causation).
1. A causal, complementary, parallel, or reciprocal relationship, especially a structural, functional, or
qualitative correspondence between two comparable entities: a correlation between drug abuse and
crime.
2. Statistics The simultaneous change in value of two numerically valued random variables: the positive
correlation between cigarette smoking and the incidence of lung cancer; the negative correlation
between age and normal vision.
3. An act of correlating or the condition of being correlated.
7
Exponential Distribution
The Exponential distribution is used to describe survival times.
Suppose that some device has the same hazard rate λ at each moment. The survival time is therefore
*1/λ+ on average.
Let the Random Variable X denote the time of failure. X then follows the Exponential distribution with
parameter λ. The Probability Density Function of X is f X x = , λ exp − λ x i f x ≥ 0 0 o t h e r w i s e .
The Expected Value of X is 1/λ and the Variance is 1/λ2.
example:
A man enters a bank at 4pm. There is one person in front of him in the queue. Suppose that the length
of time an individual should spend with a teller is an exponential random variable with mean 7 minutes.
Let X be the length of time the man in front spends with the teller. λ=*1/7+ therefore X~Ex(*1/7+. The
probability that the man who entered the bank at 4pm has to wait for more than 10 minutes to be
served is P X > 10 = 1 − F X 10 , where F X 10 is the ,cumulative distribution function- of the exponential
distribution evaluated at t=10. The cumulative distribution of the exponential distribution is F X t = ∫ 0 t λ
exp − λ x d x = * − exp − λ x + 0 t = 1 − exp − λ t
The probability that the man has to wait more than 10 minutes is therefore
1 − 1 − exp − λ t = exp -10 7 ≈ 0.240
Type I and type II errors
In statistics, a null hypothesis is a statement that the thing being studied produces no effect or makes no
difference. An example of a null hypothesis is the statement "This diet has no effect on people's weight."
Usually an experimenter frames a null hypothesis with the intent of rejecting it: that is, intending to run
an experiment which produces data that shows that the thing under study does make a difference.
8
A type I error (or error of the first kind) is the incorrect rejection of a true null hypothesis. It is a false
positive. Usually a type I error leads one to conclude that a supposed effect or relationship exists when
in fact it doesn't. Examples of type I errors include a test that shows a patient to have a disease when in
fact the patient does not have the disease, a fire alarm going off indicating a fire when in fact there is no
fire or an experiment indicating that a medical treatment should cure a disease when in fact it does not
A type II error (or error of the second kind) is the failure to reject a false null hypothesis. It is a false
negative. Examples of type II errors would be a blood test failing to detect the disease it was designed to
detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring or a
clinical trial of a medical treatment failing to show that the treatment works when really it does.
Alternative hypothesis
In statistical hypothesis testing, the alternative hypothesis (or maintained hypothesis or research
hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical
hypothesis test. An example might be where water quality in a stream has been observed over many
years and a test is made of the null hypothesis that there is no change in quality between the first and
second halves of the data against the alternative hypothesis that the quality is poorer in the second half
of the record.
Central tendency
In statistics, a central tendency (or, more commonly, a measure of central tendency) is a central value or
a typical value for a probability distribution.[1] It is occasionally called an average or just the center of
the distribution. The most common measures of central tendency are the arithmetic mean, the median
and the mode. A central tendency can be calculated for either a finite set of values or for a theoretical
distribution, such as the normal distribution. Occasionally authors use central tendency (or centrality) to
mean "the tendency of quantitative data to cluster around some central value". [2][3] This meaning
might be expected from the usual dictionary definitions of the words tendency and centrality. Those
authors may judge whether data has a strong or a weak central tendency based on the statistical
dispersion, as measured by the standard deviation or something similar.
9
Bayes' theorem
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule) is a result
that is of importance in the mathematical manipulation of conditional probabilities. It is a result that
derives from the more basic axioms of probability.
When applied, the probabilities involved in Bayes' theorem may have any of a number of probability
interpretations. In one of these interpretations, the theorem is used directly as part of a particular
approach to statistical inference. ln particular, with the Bayesian interpretation of probability, the
theorem expresses how a subjective degree of belief should rationally change to account for evidence:
this is Bayesian inference, which is fundamental to Bayesian statistics. However, Bayes' theorem has
applications in a wide range of calculations involving probabilities, not just in Bayesian inference.
An Introduction to Bayes' Theorem
Bayes' Theorem is a theorem of probability theory originally stated by the Reverend Thomas Bayes. It
can be seen as a way of understanding how the probability that a theory is true is affected by a new
piece of evidence. It has been used in a wide variety of contexts, ranging from marine biology to the
development of "Bayesian" spam blockers for email systems. In the philosophy of science, it has been
used to try to clarify the relationship between theory and evidence. Many insights in the philosophy of
science involving confirmation, falsification, the relation between science and pseudosience, and other
topics can be made more precise, and sometimes extended or corrected, by using Bayes' Theorem.
These pages will introduce the theorem and its use in the philosophy of science.
10
Chebyshev’s Theorem:
For any set of data (either population or sample) and for any constant k greater than 1, the proportion
of the data that must lie within k standard deviations on either side of the mean is at least
1 - _1_
k2
In ordinary words, Chebyshev’s Theorem says the following about sample or population data:
1) Start at the mean.
2) Back off k standard deviations below the mean and then advance k standard deviations
above the mean.
3) The fractional part of the data in the interval described will be at least 1 – 1/k2
(we assume k
> 1).
Simple random sampling
In a simple random sample (SRS) of a given size, all such subsets of the frame are given an equal
probability. Furthermore, any given pair of elements has the same chance of selection as any other such
pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In
particular, the variance between individual results within the sample is a good indicator of variance in
the overall population, which makes it relatively easy to estimate the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness of the selection may result
in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of
ten people from a given country will on average produce five men and five women, but any given trial is
likely to overrepresent one sex and underrepresent the other. (Systematic and stratified techniques),
attempt to overcome this problem by "using information about the population" to choose a more
"representative" sample.
11
One of the best ways to achieve unbiased results in a study is through random sampling. Random
sampling includes choosing subjects from a population through unpredictable means. In its simplest
form, subjects all have an equal chance of being selected out of the population being researched.
a method of selecting a sample (random sample) from a statistical population in such a way that every
possible sample that could be selected has a predetermined probability of being selected.
What is the difference between coefficient of
determination, and coefficient of correlation?
Coefficient of correlation is “R” value which is given in the summary table in the Regression output. R
square is also called coefficient of determination. Multiply R times R to get the R square value. In other
words Coefficient of Determination is the square of Coefficeint of Correlation.
R square or coeff. of determination shows percentage variation in y which is explained by all the x
variables together. Higher the better. It is always between 0 and 1. It can never be negative – since it is a
squared value.
It is easy to explain the R square in terms of regression. It is not so easy to explain the R in terms of
regression.
Coefficient of Correlation is the R value i.e. .850 (or 85%). Coefficient of Determination is the R square
value i.e. .723 (or 72.3%). R square is simply square of R i.e. R times R.
12
Coefficient of Correlation: is the degree of relationship between two variables say x and y. It can go
between -1 and 1. 1 indicates that the two variables are moving in unison. They rise and fall together
and have perfect correlation. -1 means that the two variables are in perfect opposites. One goes up and
other goes down, in perfect negative way. Any two variables in this universe can be argued to have a
correlation value. If they are not correlated then the correlation value can still be computed which
would be 0. The correlation value always lies between -1 and 1 (going thru 0 – which means no
correlation at all – perfectly not related). Correlation can be rightfully explalined for simple linear
regression – because you only have one x and one y variable. For multiple linear regression R is
computed, but then it is difficult to explain because we have multiple variables invovled here. Thats why
R square is a better term. You can explain R square for both simple linear regressions and also for multip
le linear regressions.
Descriptive statistics
Descriptive statistics is the discipline of quantitatively describing the main features of a collection of
information,[1] or the quantitative description itself. Descriptive statistics are distinguished from
inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample,
rather than use the data to learn about the population that the sample of data is thought to represent.
This generally means that descriptive statistics, unlike inferential statistics, are not developed on the
basis of probability theory.[2] Even when a data analysis draws its main conclusions using inferential
statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study
involving human subjects, there typically appears a table giving the overall sample size, sample sizes in
important subgroups (e.g., for each treatment or exposure group), and demographic or clinical
characteristics such as the average age, the proportion of subjects of each sex, and the proportion of
subjects with related comorbidities.
Some measures that are commonly used to describe a data set are measures of central tendency and
measures of variability or dispersion. Measures of central tendency include the mean, median and
mode, while measures of variability include the standard deviation (or variance), the minimum and
maximum values of the variables, kurtosis and skewness.[3]
13
Statistical inference
In statistics, statistical inference is the process of drawing conclusions from data that are subject to random
variation, for example, observational errors or sampling variation.[1] More substantially, the terms statistical
inference, statistical induction and inferential statistics are used to describe systems of procedures that can be
used to draw conclusions from datasets arising from systems affected by random variation,[2] such as
observational errors, random sampling, or random experimentation.[1] Initial requirements of such a system of
procedures for inference and induction are that the system should produce reasonable answers when applied to
well-defined situations and that it should be general enough to be applied across a range of situations. Inferential
statistics are used to test hypotheses and make estimations using sample data.
The outcome of statistical inference may be an answer to the question "what should be done next?", where this
might be a decision about making further experiments or surveys, or about drawing a conclusion before
implementing some organizational or governmental policy.
Chrachteristics of good estimator
There are four main properties associated with a "good" estimator. These are:
1) Unbiasedness: the expected value of the estimator (or the mean of the estimator) is simply the figure being
estimated.In statistical terms, E(estimate of Y) = Y.
2) Consistency: the estimator converges in probability with the estimated figure. In other words, as the sample size
approaches the population size, the estimator gets closer and closer to the estimated.
3) Efficiency: The estimator has a low variance, usually relative to other estimators, which is called relative
efficiency. Otherwise, the variance of the estimator is minimized.
4) Robustness: The mean-squared errors of the estimator are minimized relative to other estimators.The estimator
should be unbiased and consistent
Stats: Test for Independence
In the test for independence, the claim is that the row and column variables are independent of each other. This is
the null hypothesis.
The multiplication rule said that if two events were independent, then the probability of both occurring was the
product of the probabilities of each occurring. This is key to working the test for independence. If you end up
14
rejecting the null hypothesis, then the assumption must have been wrong and the row and column variable are
dependent. Remember, all hypothesis testing is done under the assumption the null hypothesis is true.
The test statistic used is the same as the chi-square goodness-of-fit test. The principle behind the test for
independence is the same as the principle behind the goodness-of-fit test. The test for independence is always a
right tail test.
In fact, you can think of the test for independence as a goodness-of-fit test where the data is arranged into table
form. This table is called a contingency table.
The test statistic has a chi-square distribution when the following assumptions are met
The data are obtained from a random sample
The expected frequency of each category must be at least 5.
The following are properties of the test for independence
The data are the observed frequencies.
The data is arranged into a contingency table.
The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the
column variable. It is not one less than the sample size, it is the product of the two degrees of freedom.
It is always a right tail test.
It has a chi-square distribution.
The expected value is computed by taking the row total times the column total and dividing by the grand total
The value of the test statistic doesn't change if the order of the rows or columns are switched.
The value of the test statistic doesn't change if the rows and columns are interchanged (transpose of the matrix)
15
Utility of regression studies
Regression models can be used to help understand and explain relationships among variables; they can also be
used to predict actual outcomes. In this course you will learn how multiple linear regression models are derived,
use software to implement them, learn what assumptions underlie the models, learn how to test whether your
data meet those assumptions and what can be done when those assumptions are not met, and develop strategies
for building and understanding useful models.
Advantages of Sample Surveys
Cost Reduction
In most cases, conducting a sample survey costs less than a census survey
. If fewer people are surveyed, fewer
surveys need to be produced, printed, shipped, administered, and analyzed. Further
, fewer data reports are often
required, thus the amount of time and expense required to analyze and distribute the results reports is reduced.
Generalizability of Results
If conducted properly, the results of a sample survey can still be generalized to the entire population, meaning
that the sample results can be considered representative of the views of the entire target population. Sampling
strategies should be firmly aligned with the overarching survey goals to ensure the utilization of a proper sample
frame and sample size.
Timeliness
Sample surveys can typically be printed, distributed, administered, and analyzed more quickly than census
surveys. As a result, a shorter turnaround time for results is often achieved
Identification of Strengths & Opportunities
As with census surveys, results from a properly conducted sample survey can also be used to identify strengths
and opportunities and develop plans for meaningful change.
Cost: By comparison with a complete enumeration of the same population, a sample may be based on data for
only small number of the units comprising that population. A sample survey may thus be very much less expensive
to conduct than a comparable complete enumeration.
16
Time: Being small in scale, a sample survey is not only less expensive than a census; the desired information is
obtained in much less time.
Scope: The smaller scale is likely to; permit the collection of a wider range of survey data and allow a wide choice
of methods of observations, measurements or questioning than is usually feasible with a complete enumeration.
Respondents Convenience: The sample survey considerably reduces the overall burden of the respondents in the
way that only a few, not all of the individuals in the population are put to the trouble of having to answer
questions or provide information.
Labor: Sampling saves labor. A small staff is required both for fieldwork and for tabulation and processing data.
Flexibility: In certain types of investigation, highly skilled and trained personnel or even specialized equipment are
needed to collect data. A complete enumeration in such cases is impracticable and hence sample surveys, being
more flexible and greater scope, will be more appropriate for this type of inquires.
Data Processing: The data-processing requirement for a sample survey is likely to be much less than for a complete
count. Whereas a complete count may well require a computer to process the data, a sample survey can often be
processed manually with fewer people and less logistic supports.
Accuracy: A sample survey employs personnel of higher quality equipped with intensive training and supervision
that is more careful is possible for fieldwork. As a result, observations, measurements, equipments, or questioning
for a sample survey can often be carried out more carefully and thus yields results subject to similar non-sampling
error than is generally practicable in a more extensive complete enumeration, usually at a much lower cost.
Feasibility: there are situations where complete enumeration is not feasible and thus a survey is necessary. There
is also instance where it is not practicable to enumerate all the units due to their perishable or fragile nature. The
alternative in this situation is to take only a few of the units. For example, consider the problem of checking the
quality of mango juice produced by a local company. One way to test the quality is to drink entire lot, which is
impracticable. Testing of electric bulb, screws, glass, medicine all are example of this type, where sampling is
necessary.
17
The hypergeometric distribution applies to sampling without replacement from a
finite population whose elements can be classified into two mutually exclusive categories like Pass/Fail,
Male/Female or Employed/Unemployed. As random selections are made from the population, each subsequent
draw decreases the population causing the probability of success to change with each draw.
The following conditions characterise the hypergeometric distribution:
The result of each draw can be classified into one or two categories.
The probability of a success changes on each draw.
A random variable X follows the hypergeometric distribution if its probability mass function (pmf) is given by:[1]
P(X=k) = {{{K choose k} {{N-K} choose {n-k}}}over {N choose n}}
Where:
N is the population size
K is the number of success states in the population
n is the number of draws
k is the number of successes
textstyle {a choose b} is a binomial coefficient
The pmf is positive when max(0, n+K-N) leq k leq min(K,n).
18
leptokurtic distribution
In probability theory and statistics, kurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning curved,
arching) is any measure of the "peakedness" of the probability distribution of a real-valued random variable.[1] In
a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just
as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of
estimating it from a sample from a population. There are various interpretations of kurtosis, and of how particular
measures should be interpreted; these are primarily peakedness (width of peak), tail weight, and lack of shoulders
(distribution primarily peak and tails, not in between).
One common measure of kurtosis, originating with Karl Pearson, is based on a scaled version of the fourth moment
of the data or population, but it has been argued that this really measures heavy tails, and not peakedness.[2] For
this measure, higher kurtosis means more of the variance is the result of infrequent extreme deviations, as
opposed to frequent modestly sized deviations. It is common practice to use an adjusted version of Pearson's
kurtosis, the excess kurtosis, to provide a comparison of the shape of a given distribution to that of the normal
distribution. Distributions with negative or positive excess kurtosis are called platykurtic distributions or leptokurtic
distributions respectively
the interquartile range
the interquartile range (IQR), also called the midspread or middle fifty, is a measure of statistical dispersion, being
equal to the difference between the upper and lower quartiles,*1+*2+ IQR = Q3 − Q1. In other words, the IQR is the
1st Quartile subtracted from the 3rd Quartile; these quartiles can be clearly seen on a box plot on the data. It is a
trimmed estimator, defined as the 25% trimmed mid-range, and is the most significant basic robust measure of
scale.

Weitere ähnliche Inhalte

Was ist angesagt?

Statistical hypothesis
Statistical hypothesisStatistical hypothesis
Statistical hypothesis
Hasnain Baber
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesis
vikramlawand
 
S5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t testS5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t test
Rachel Chung
 
Formulating hypotheses
Formulating hypothesesFormulating hypotheses
Formulating hypotheses
Aniket Verma
 

Was ist angesagt? (20)

Hypothesis
HypothesisHypothesis
Hypothesis
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
 
Statistical hypothesis
Statistical hypothesisStatistical hypothesis
Statistical hypothesis
 
Test of hypothesis
Test of hypothesisTest of hypothesis
Test of hypothesis
 
Types of Hypothesis-Advance Research Methodology
Types of Hypothesis-Advance Research MethodologyTypes of Hypothesis-Advance Research Methodology
Types of Hypothesis-Advance Research Methodology
 
Hypothesis types, formulation, and testing
Hypothesis types, formulation, and testingHypothesis types, formulation, and testing
Hypothesis types, formulation, and testing
 
Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2Hypothesis Testing. Inferential Statistics pt. 2
Hypothesis Testing. Inferential Statistics pt. 2
 
T‑tests
T‑testsT‑tests
T‑tests
 
Hypothesis
Hypothesis Hypothesis
Hypothesis
 
S5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t testS5 w1 hypothesis testing & t test
S5 w1 hypothesis testing & t test
 
Research
ResearchResearch
Research
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively Statistical skepticism: How to use significance tests effectively
Statistical skepticism: How to use significance tests effectively
 
Basics of Educational Statistics (Hypothesis and types)
Basics of Educational Statistics (Hypothesis and types)Basics of Educational Statistics (Hypothesis and types)
Basics of Educational Statistics (Hypothesis and types)
 
Formulating hypotheses
Formulating hypothesesFormulating hypotheses
Formulating hypotheses
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
P-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural researchP-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural research
 
types of hypothesis
types of hypothesistypes of hypothesis
types of hypothesis
 

Ähnlich wie Statistics basics

What do you think will likely happen when a cell containing 1 suc.docx
What do you think will likely happen when a cell containing 1 suc.docxWhat do you think will likely happen when a cell containing 1 suc.docx
What do you think will likely happen when a cell containing 1 suc.docx
alanfhall8953
 
OBSERVATIONAL STUDIES PPT.pptx
OBSERVATIONAL STUDIES PPT.pptxOBSERVATIONAL STUDIES PPT.pptx
OBSERVATIONAL STUDIES PPT.pptx
KrishnaveniManubolu
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
karlhennesey
 

Ähnlich wie Statistics basics (20)

20 OCT-Hypothesis Testing.ppt
20 OCT-Hypothesis Testing.ppt20 OCT-Hypothesis Testing.ppt
20 OCT-Hypothesis Testing.ppt
 
HYPOTHESIS
HYPOTHESISHYPOTHESIS
HYPOTHESIS
 
Hypotheses
HypothesesHypotheses
Hypotheses
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Chapter 9-10.James Dean Brown.Farajnezhad
Chapter 9-10.James Dean Brown.FarajnezhadChapter 9-10.James Dean Brown.Farajnezhad
Chapter 9-10.James Dean Brown.Farajnezhad
 
What do you think will likely happen when a cell containing 1 suc.docx
What do you think will likely happen when a cell containing 1 suc.docxWhat do you think will likely happen when a cell containing 1 suc.docx
What do you think will likely happen when a cell containing 1 suc.docx
 
ch 2 hypothesis
ch 2 hypothesisch 2 hypothesis
ch 2 hypothesis
 
Hypothesis Formulation
Hypothesis Formulation Hypothesis Formulation
Hypothesis Formulation
 
NULL AND ALTERNATIVE HYPOTHESIS.pptx
NULL AND ALTERNATIVE HYPOTHESIS.pptxNULL AND ALTERNATIVE HYPOTHESIS.pptx
NULL AND ALTERNATIVE HYPOTHESIS.pptx
 
Testing of hypothesis and tests of significance
Testing of hypothesis and tests of significanceTesting of hypothesis and tests of significance
Testing of hypothesis and tests of significance
 
Parmetric and non parametric statistical test in clinical trails
Parmetric and non parametric statistical test in clinical trailsParmetric and non parametric statistical test in clinical trails
Parmetric and non parametric statistical test in clinical trails
 
OBSERVATIONAL STUDIES PPT.pptx
OBSERVATIONAL STUDIES PPT.pptxOBSERVATIONAL STUDIES PPT.pptx
OBSERVATIONAL STUDIES PPT.pptx
 
Types of hypothesis
Types of hypothesisTypes of hypothesis
Types of hypothesis
 
Statistics
StatisticsStatistics
Statistics
 
research (hypothesis).pdf
research (hypothesis).pdfresearch (hypothesis).pdf
research (hypothesis).pdf
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Population_ sample and hypothesis.pdf
Population_ sample and hypothesis.pdfPopulation_ sample and hypothesis.pdf
Population_ sample and hypothesis.pdf
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
 
pengajian malaysia
pengajian malaysiapengajian malaysia
pengajian malaysia
 
RURAL RESEARCH METHOD AND METHODOLOGY ,,, Hypothesis In Research
RURAL RESEARCH METHOD AND METHODOLOGY ,,,  Hypothesis In ResearchRURAL RESEARCH METHOD AND METHODOLOGY ,,,  Hypothesis In Research
RURAL RESEARCH METHOD AND METHODOLOGY ,,, Hypothesis In Research
 

Mehr von Sadrani Yash

Mehr von Sadrani Yash (12)

Why social media marketing is inevitable for educational institutes Business ...
Why social media marketing is inevitable for educational institutes Business ...Why social media marketing is inevitable for educational institutes Business ...
Why social media marketing is inevitable for educational institutes Business ...
 
Budget 2014 effects,impacts and benifits
Budget 2014 effects,impacts and benifitsBudget 2014 effects,impacts and benifits
Budget 2014 effects,impacts and benifits
 
Facebook
FacebookFacebook
Facebook
 
Mission vision statements of top companies
Mission vision statements of top companiesMission vision statements of top companies
Mission vision statements of top companies
 
Cost Accounting
Cost AccountingCost Accounting
Cost Accounting
 
BREAK EVEN ANALYSIS
BREAK EVEN ANALYSISBREAK EVEN ANALYSIS
BREAK EVEN ANALYSIS
 
Entreprenure case report
Entreprenure case reportEntreprenure case report
Entreprenure case report
 
E commerce
E commerce E commerce
E commerce
 
HDFC Bank Industrial Report
HDFC Bank Industrial Report  HDFC Bank Industrial Report
HDFC Bank Industrial Report
 
Presentation on economic case
Presentation on economic casePresentation on economic case
Presentation on economic case
 
CSR of ITC
CSR of ITCCSR of ITC
CSR of ITC
 
Cadbury products, history and takeovers
Cadbury products, history and takeoversCadbury products, history and takeovers
Cadbury products, history and takeovers
 

Kürzlich hochgeladen

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

Statistics basics

  • 1. STATISTICS BASICS BBA CENTERED yash sadrani RK UNIVERSITY RAJKOT
  • 2. 1 CONTENTS 1. Hypothesis 2. Null hypothesis 3. Regression 4. cor•re•la•tion 5. Exponential Distribution 6. Alternative hypothesis 7. Central tendency 8. Central tendency 9. Bayes' theorem 10. Chebyshev’s Theorem 11. Simple random sampling 12. Descriptive statistics 13. Statistical inference 14. Characteristics of good estimator 15. properties of the test for independence
  • 3. 2 16. Utility of regression studies 17. Advantages of Sample Surveys 18. The hypergeometric distribution 19. leptokurtic distribution 20. the interquartile range
  • 4. 3 Hypothsys When a possible correlation or similar relation between phenomena is investigated, such as, for example, whether a proposed remedy is effective in treating a disease, that is, at least to some extent and for some patients, the hypothesis that a relation exists cannot be examined the same way one might examine a proposed new law of nature: in such an investigation a few cases in which the tested remedy shows no effect do not falsify the hypothesis. Instead, statistical tests are used to determine how likely it is that the overall effect would be observed if no real relation as hypothesized exists. If that likelihood is sufficiently small (e.g., less than 1%), the existence of a relation may be assumed. Otherwise, any observed effect may as well be due to pure chance. In statistical hypothesis testing two hypotheses are compared, which are called the null hypothesis and the alternative hypothesis. The null hypothesis is the hypothesis that states that there is no relation between the phenomena whose relation is under investigation, or at least not of the form given by the alternative hypothesis. The alternative hypothesis, as the name suggests, is the alternative to the null hypothesis: it states that there is some kind of relation. The alternative hypothesis may take several forms, depending on the nature of the hypothesized relation; in particular, it can be two-sided (for example: there is some effect, in a yet unknown direction) or one-sided (the direction of the hypothesized relation, positive or negative, is fixed in advance). Conventional significance levels for testing the hypotheses are .10, .05, and .01. Whether the null hypothesis is rejected and the alternative hypothesis is accepted, all must be determined in advance, before the observations are collected or inspected. If these criteria are determined later, when the data to be tested is already known, the test is invalid. It is important to mention that the above procedure is actually dependent on the number of the participants (units or sample size) that is included in the study. For instance, the sample size may be too small to reject a null hypothesis and, therefore, is recommended to specify the sample size from the beginning. It is advisable to define a small, medium and large effect size for each of a number of the important statistical tests which are used to test the hypotheses. A statistical hypothesis test is a method of statistical inference using data from a scientific study. In statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level. The phrase "test of significance" was coined by statistician Ronald Fisher.[1] These tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance; this can help to decide whether results contain enough information to cast doubt
  • 5. 4 onconventional wisdom, given that conventional wisdom has been used to establish the null hypothesis. The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejected in favor of the alternative hypothesis. Statistical hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis, which may not have pre-specified hypotheses. Example 1 – Philosopher's beans The following example was produced by a philosopher describing scientific methods generations before hypothesis testing was formalized and popularized. Few beans of this handful are white. Most beans in this bag are white. Therefore: Probably, these beans were taken from another bag. This is an hypothetical inference. The beans in the bag are the population. The handful are the sample. The null hypothesis is that the sample originated from the population. The criterion for rejecting the null-hypothesis is the "obvious" difference in appearance (an informal difference in the mean). The interesting result is that consideration of a real population and a real sample produced an imaginary bag. The philosopher was considering logic rather than probability. To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard. A simple generalization of the example considers a mixed bag of beans and a handful that contain either very few or very many white beans. The generalization considers both extremes. It requires more calculations and more comparisons to arrive at a formal answer, but the core philosophy is unchanged; If the composition of the handful is greatly different that of the bag, then the sample probably originated from another bag. The original example is termed a one-sided or a one-tailed test while the generalization is termed a two-sided or two-tailed test.
  • 6. 5 Null hypothesis In statistical inference of observed data of a scientific experiment, the null hypothesis refers to a general or default position: that there is no relationship between two measured phenomena,[1] or that a potential medical treatment has no effect.[2] Rejecting or disproving the null hypothesis – and thus concluding that there are grounds for believing that there is a relationship between two phenomena or that a potential treatment has a measurable effect – is a central task in the modern practice of science, and gives a precise sense in which a claim is capable of being proven false. In statistical significance, the null hypothesis is often denoted H0 (read “H-naught”), is generally assumed true until evidence indicates otherwise (e.g., H0: μ = 500 hours). The concept of a null hypothesis is used differently in two approaches to statistical inference, though, problematically, the same term is used. In the significance testing approach of Ronald Fisher, a null hypothesis is potentially rejected or disproved on the basis of data that is significantly under its assumption, but never accepted or proved. In the hypothesis testing approach of Jerzy Neyman and Egon Pearson, a null hypothesis is contrasted with an alternative hypothesis, and these are decided between on the basis of data, with certain error rates. These two approaches criticized each other, though today a hybrid approach is widely practiced and presented in textbooks. This hybrid is in turn criticized as incorrect and incoherent – see statistical hypothesis testing. Statistical significance plays a pivotal role in statistical hypothesis testing where it is used to determine if a null hypothesis can be rejected or retained. regression In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'Criterion Variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution. 1. (Psychology) psychol the adoption by an adult or adolescent of behaviour more appropriate to a child, esp as a defence mechanism to avoid anxiety 2. (Statistics) statistics
  • 7. 6 a. the analysis or measure of the association between one variable (the dependent variable) and one or more other variables (the independent variables), usually formulated in an equation in which the independent variables have parametric coefficients, which may enable future values of the dependent variable to be predicted b. (as modifer): regression curve. 3. (Astronomy) astronomy the slow movement around the ecliptic of the two points at which the moon's orbit intersects the ecliptic. One complete revolution occurs about every 19 years 4. (Geological Science) geology the retreat of the sea from the land 5. (Statistics) the act of regressing 6. (Logic) the act of regressing cor·re·la·tion In statistics, dependence is any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation). 1. A causal, complementary, parallel, or reciprocal relationship, especially a structural, functional, or qualitative correspondence between two comparable entities: a correlation between drug abuse and crime. 2. Statistics The simultaneous change in value of two numerically valued random variables: the positive correlation between cigarette smoking and the incidence of lung cancer; the negative correlation between age and normal vision. 3. An act of correlating or the condition of being correlated.
  • 8. 7 Exponential Distribution The Exponential distribution is used to describe survival times. Suppose that some device has the same hazard rate λ at each moment. The survival time is therefore *1/λ+ on average. Let the Random Variable X denote the time of failure. X then follows the Exponential distribution with parameter λ. The Probability Density Function of X is f X x = , λ exp − λ x i f x ≥ 0 0 o t h e r w i s e . The Expected Value of X is 1/λ and the Variance is 1/λ2. example: A man enters a bank at 4pm. There is one person in front of him in the queue. Suppose that the length of time an individual should spend with a teller is an exponential random variable with mean 7 minutes. Let X be the length of time the man in front spends with the teller. λ=*1/7+ therefore X~Ex(*1/7+. The probability that the man who entered the bank at 4pm has to wait for more than 10 minutes to be served is P X > 10 = 1 − F X 10 , where F X 10 is the ,cumulative distribution function- of the exponential distribution evaluated at t=10. The cumulative distribution of the exponential distribution is F X t = ∫ 0 t λ exp − λ x d x = * − exp − λ x + 0 t = 1 − exp − λ t The probability that the man has to wait more than 10 minutes is therefore 1 − 1 − exp − λ t = exp -10 7 ≈ 0.240 Type I and type II errors In statistics, a null hypothesis is a statement that the thing being studied produces no effect or makes no difference. An example of a null hypothesis is the statement "This diet has no effect on people's weight." Usually an experimenter frames a null hypothesis with the intent of rejecting it: that is, intending to run an experiment which produces data that shows that the thing under study does make a difference.
  • 9. 8 A type I error (or error of the first kind) is the incorrect rejection of a true null hypothesis. It is a false positive. Usually a type I error leads one to conclude that a supposed effect or relationship exists when in fact it doesn't. Examples of type I errors include a test that shows a patient to have a disease when in fact the patient does not have the disease, a fire alarm going off indicating a fire when in fact there is no fire or an experiment indicating that a medical treatment should cure a disease when in fact it does not A type II error (or error of the second kind) is the failure to reject a false null hypothesis. It is a false negative. Examples of type II errors would be a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring or a clinical trial of a medical treatment failing to show that the treatment works when really it does. Alternative hypothesis In statistical hypothesis testing, the alternative hypothesis (or maintained hypothesis or research hypothesis) and the null hypothesis are the two rival hypotheses which are compared by a statistical hypothesis test. An example might be where water quality in a stream has been observed over many years and a test is made of the null hypothesis that there is no change in quality between the first and second halves of the data against the alternative hypothesis that the quality is poorer in the second half of the record. Central tendency In statistics, a central tendency (or, more commonly, a measure of central tendency) is a central value or a typical value for a probability distribution.[1] It is occasionally called an average or just the center of the distribution. The most common measures of central tendency are the arithmetic mean, the median and the mode. A central tendency can be calculated for either a finite set of values or for a theoretical distribution, such as the normal distribution. Occasionally authors use central tendency (or centrality) to mean "the tendency of quantitative data to cluster around some central value". [2][3] This meaning might be expected from the usual dictionary definitions of the words tendency and centrality. Those authors may judge whether data has a strong or a weak central tendency based on the statistical dispersion, as measured by the standard deviation or something similar.
  • 10. 9 Bayes' theorem In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule) is a result that is of importance in the mathematical manipulation of conditional probabilities. It is a result that derives from the more basic axioms of probability. When applied, the probabilities involved in Bayes' theorem may have any of a number of probability interpretations. In one of these interpretations, the theorem is used directly as part of a particular approach to statistical inference. ln particular, with the Bayesian interpretation of probability, the theorem expresses how a subjective degree of belief should rationally change to account for evidence: this is Bayesian inference, which is fundamental to Bayesian statistics. However, Bayes' theorem has applications in a wide range of calculations involving probabilities, not just in Bayesian inference. An Introduction to Bayes' Theorem Bayes' Theorem is a theorem of probability theory originally stated by the Reverend Thomas Bayes. It can be seen as a way of understanding how the probability that a theory is true is affected by a new piece of evidence. It has been used in a wide variety of contexts, ranging from marine biology to the development of "Bayesian" spam blockers for email systems. In the philosophy of science, it has been used to try to clarify the relationship between theory and evidence. Many insights in the philosophy of science involving confirmation, falsification, the relation between science and pseudosience, and other topics can be made more precise, and sometimes extended or corrected, by using Bayes' Theorem. These pages will introduce the theorem and its use in the philosophy of science.
  • 11. 10 Chebyshev’s Theorem: For any set of data (either population or sample) and for any constant k greater than 1, the proportion of the data that must lie within k standard deviations on either side of the mean is at least 1 - _1_ k2 In ordinary words, Chebyshev’s Theorem says the following about sample or population data: 1) Start at the mean. 2) Back off k standard deviations below the mean and then advance k standard deviations above the mean. 3) The fractional part of the data in the interval described will be at least 1 – 1/k2 (we assume k > 1). Simple random sampling In a simple random sample (SRS) of a given size, all such subsets of the frame are given an equal probability. Furthermore, any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results. However, SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of ten people from a given country will on average produce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. (Systematic and stratified techniques), attempt to overcome this problem by "using information about the population" to choose a more "representative" sample.
  • 12. 11 One of the best ways to achieve unbiased results in a study is through random sampling. Random sampling includes choosing subjects from a population through unpredictable means. In its simplest form, subjects all have an equal chance of being selected out of the population being researched. a method of selecting a sample (random sample) from a statistical population in such a way that every possible sample that could be selected has a predetermined probability of being selected. What is the difference between coefficient of determination, and coefficient of correlation? Coefficient of correlation is “R” value which is given in the summary table in the Regression output. R square is also called coefficient of determination. Multiply R times R to get the R square value. In other words Coefficient of Determination is the square of Coefficeint of Correlation. R square or coeff. of determination shows percentage variation in y which is explained by all the x variables together. Higher the better. It is always between 0 and 1. It can never be negative – since it is a squared value. It is easy to explain the R square in terms of regression. It is not so easy to explain the R in terms of regression. Coefficient of Correlation is the R value i.e. .850 (or 85%). Coefficient of Determination is the R square value i.e. .723 (or 72.3%). R square is simply square of R i.e. R times R.
  • 13. 12 Coefficient of Correlation: is the degree of relationship between two variables say x and y. It can go between -1 and 1. 1 indicates that the two variables are moving in unison. They rise and fall together and have perfect correlation. -1 means that the two variables are in perfect opposites. One goes up and other goes down, in perfect negative way. Any two variables in this universe can be argued to have a correlation value. If they are not correlated then the correlation value can still be computed which would be 0. The correlation value always lies between -1 and 1 (going thru 0 – which means no correlation at all – perfectly not related). Correlation can be rightfully explalined for simple linear regression – because you only have one x and one y variable. For multiple linear regression R is computed, but then it is difficult to explain because we have multiple variables invovled here. Thats why R square is a better term. You can explain R square for both simple linear regressions and also for multip le linear regressions. Descriptive statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of information,[1] or the quantitative description itself. Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory.[2] Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. Some measures that are commonly used to describe a data set are measures of central tendency and measures of variability or dispersion. Measures of central tendency include the mean, median and mode, while measures of variability include the standard deviation (or variance), the minimum and maximum values of the variables, kurtosis and skewness.[3]
  • 14. 13 Statistical inference In statistics, statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation.[1] More substantially, the terms statistical inference, statistical induction and inferential statistics are used to describe systems of procedures that can be used to draw conclusions from datasets arising from systems affected by random variation,[2] such as observational errors, random sampling, or random experimentation.[1] Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy. Chrachteristics of good estimator There are four main properties associated with a "good" estimator. These are: 1) Unbiasedness: the expected value of the estimator (or the mean of the estimator) is simply the figure being estimated.In statistical terms, E(estimate of Y) = Y. 2) Consistency: the estimator converges in probability with the estimated figure. In other words, as the sample size approaches the population size, the estimator gets closer and closer to the estimated. 3) Efficiency: The estimator has a low variance, usually relative to other estimators, which is called relative efficiency. Otherwise, the variance of the estimator is minimized. 4) Robustness: The mean-squared errors of the estimator are minimized relative to other estimators.The estimator should be unbiased and consistent Stats: Test for Independence In the test for independence, the claim is that the row and column variables are independent of each other. This is the null hypothesis. The multiplication rule said that if two events were independent, then the probability of both occurring was the product of the probabilities of each occurring. This is key to working the test for independence. If you end up
  • 15. 14 rejecting the null hypothesis, then the assumption must have been wrong and the row and column variable are dependent. Remember, all hypothesis testing is done under the assumption the null hypothesis is true. The test statistic used is the same as the chi-square goodness-of-fit test. The principle behind the test for independence is the same as the principle behind the goodness-of-fit test. The test for independence is always a right tail test. In fact, you can think of the test for independence as a goodness-of-fit test where the data is arranged into table form. This table is called a contingency table. The test statistic has a chi-square distribution when the following assumptions are met The data are obtained from a random sample The expected frequency of each category must be at least 5. The following are properties of the test for independence The data are the observed frequencies. The data is arranged into a contingency table. The degrees of freedom are the degrees of freedom for the row variable times the degrees of freedom for the column variable. It is not one less than the sample size, it is the product of the two degrees of freedom. It is always a right tail test. It has a chi-square distribution. The expected value is computed by taking the row total times the column total and dividing by the grand total The value of the test statistic doesn't change if the order of the rows or columns are switched. The value of the test statistic doesn't change if the rows and columns are interchanged (transpose of the matrix)
  • 16. 15 Utility of regression studies Regression models can be used to help understand and explain relationships among variables; they can also be used to predict actual outcomes. In this course you will learn how multiple linear regression models are derived, use software to implement them, learn what assumptions underlie the models, learn how to test whether your data meet those assumptions and what can be done when those assumptions are not met, and develop strategies for building and understanding useful models. Advantages of Sample Surveys Cost Reduction In most cases, conducting a sample survey costs less than a census survey . If fewer people are surveyed, fewer surveys need to be produced, printed, shipped, administered, and analyzed. Further , fewer data reports are often required, thus the amount of time and expense required to analyze and distribute the results reports is reduced. Generalizability of Results If conducted properly, the results of a sample survey can still be generalized to the entire population, meaning that the sample results can be considered representative of the views of the entire target population. Sampling strategies should be firmly aligned with the overarching survey goals to ensure the utilization of a proper sample frame and sample size. Timeliness Sample surveys can typically be printed, distributed, administered, and analyzed more quickly than census surveys. As a result, a shorter turnaround time for results is often achieved Identification of Strengths & Opportunities As with census surveys, results from a properly conducted sample survey can also be used to identify strengths and opportunities and develop plans for meaningful change. Cost: By comparison with a complete enumeration of the same population, a sample may be based on data for only small number of the units comprising that population. A sample survey may thus be very much less expensive to conduct than a comparable complete enumeration.
  • 17. 16 Time: Being small in scale, a sample survey is not only less expensive than a census; the desired information is obtained in much less time. Scope: The smaller scale is likely to; permit the collection of a wider range of survey data and allow a wide choice of methods of observations, measurements or questioning than is usually feasible with a complete enumeration. Respondents Convenience: The sample survey considerably reduces the overall burden of the respondents in the way that only a few, not all of the individuals in the population are put to the trouble of having to answer questions or provide information. Labor: Sampling saves labor. A small staff is required both for fieldwork and for tabulation and processing data. Flexibility: In certain types of investigation, highly skilled and trained personnel or even specialized equipment are needed to collect data. A complete enumeration in such cases is impracticable and hence sample surveys, being more flexible and greater scope, will be more appropriate for this type of inquires. Data Processing: The data-processing requirement for a sample survey is likely to be much less than for a complete count. Whereas a complete count may well require a computer to process the data, a sample survey can often be processed manually with fewer people and less logistic supports. Accuracy: A sample survey employs personnel of higher quality equipped with intensive training and supervision that is more careful is possible for fieldwork. As a result, observations, measurements, equipments, or questioning for a sample survey can often be carried out more carefully and thus yields results subject to similar non-sampling error than is generally practicable in a more extensive complete enumeration, usually at a much lower cost. Feasibility: there are situations where complete enumeration is not feasible and thus a survey is necessary. There is also instance where it is not practicable to enumerate all the units due to their perishable or fragile nature. The alternative in this situation is to take only a few of the units. For example, consider the problem of checking the quality of mango juice produced by a local company. One way to test the quality is to drink entire lot, which is impracticable. Testing of electric bulb, screws, glass, medicine all are example of this type, where sampling is necessary.
  • 18. 17 The hypergeometric distribution applies to sampling without replacement from a finite population whose elements can be classified into two mutually exclusive categories like Pass/Fail, Male/Female or Employed/Unemployed. As random selections are made from the population, each subsequent draw decreases the population causing the probability of success to change with each draw. The following conditions characterise the hypergeometric distribution: The result of each draw can be classified into one or two categories. The probability of a success changes on each draw. A random variable X follows the hypergeometric distribution if its probability mass function (pmf) is given by:[1] P(X=k) = {{{K choose k} {{N-K} choose {n-k}}}over {N choose n}} Where: N is the population size K is the number of success states in the population n is the number of draws k is the number of successes textstyle {a choose b} is a binomial coefficient The pmf is positive when max(0, n+K-N) leq k leq min(K,n).
  • 19. 18 leptokurtic distribution In probability theory and statistics, kurtosis (from the Greek word κυρτός, kyrtos or kurtos, meaning curved, arching) is any measure of the "peakedness" of the probability distribution of a real-valued random variable.[1] In a similar way to the concept of skewness, kurtosis is a descriptor of the shape of a probability distribution and, just as for skewness, there are different ways of quantifying it for a theoretical distribution and corresponding ways of estimating it from a sample from a population. There are various interpretations of kurtosis, and of how particular measures should be interpreted; these are primarily peakedness (width of peak), tail weight, and lack of shoulders (distribution primarily peak and tails, not in between). One common measure of kurtosis, originating with Karl Pearson, is based on a scaled version of the fourth moment of the data or population, but it has been argued that this really measures heavy tails, and not peakedness.[2] For this measure, higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations. It is common practice to use an adjusted version of Pearson's kurtosis, the excess kurtosis, to provide a comparison of the shape of a given distribution to that of the normal distribution. Distributions with negative or positive excess kurtosis are called platykurtic distributions or leptokurtic distributions respectively the interquartile range the interquartile range (IQR), also called the midspread or middle fifty, is a measure of statistical dispersion, being equal to the difference between the upper and lower quartiles,*1+*2+ IQR = Q3 − Q1. In other words, the IQR is the 1st Quartile subtracted from the 3rd Quartile; these quartiles can be clearly seen on a box plot on the data. It is a trimmed estimator, defined as the 25% trimmed mid-range, and is the most significant basic robust measure of scale.