2. Objective
• Introduce the concept of distorting “anti-statistics”,
illustrate how “anti-statistics” can be identified and define
how statistics should be constructed to yield insight and
meaning
May 18, 2010 2
3. Statistics
• A statistic has two roles - primary and secondary
− Primary - to summarise and describe the data while preserving
information and reducing the volume of raw data
− Secondary - to provide and enable insight
• Where an alleged statistic does not perform these
functions it is an “anti-statistic”
− Distorting the underlying information (raw data), either
deliberately or accidentally
− Not providing insight or providing an inaccurate view of the
underlying information
• Most people are scared of large sets of numbers
− The use of anti-statistics uses this fear
May 18, 2010 3
5. Statistics - Primary Function
• To describe the data while preserving information and
reducing the volume of raw data
• This means taking a large amount of raw data, producing
descriptive summaries while not losing or distorting the
underlying raw data
• More important function of a statistic
May 18, 2010 5
6. Statistics - Secondary Function
• To provide and enable insight
• By reducing the volume of raw data, you can gain insight
into what the data means
− Enabling you to see the wood from the trees, know the amount
and type of wood and make decisions about the use of the wood
• Secondary function if primary function satisfied
May 18, 2010 6
7. Data, Information, Knowledge and Action Cycle
• Good Knowledge
statistics
provide Action
information
that creates
knowledge
and enables
correct
actions
Information
Data
May 18, 2010 7
9. Sample Information
• 4,000 numbers representing the annual salaries of
individuals
− Sample data only
• 100% of the information is available here
• Very hard to see patterns, understand the situation, gain
insight and make effective decisions and understand their
consequences
• The numbers do not lie but they are innocent creatures
and can be made to lie
• Need techniques that extract meaning and provide insight
without losing the information the data represents
May 18, 2010 9
10. Statistics
• I can take all this …
• … And give you one derived number (average)
− 107941.931
May 18, 2010 10
11. Statistic
• 4,000 numbers reduced to 1
• Reduced the amount of data by 99.975% (another
“statistic”)
• But I have lost information
• Average value of 107941.931 is at best a simplistic view of
the data and at worst a distortion that misrepresents the
source data
• If I use the average without looking to understand the raw
data in more detail I am potentially creating a distortion
May 18, 2010 11
12. More Statistics
Average Sum of all the values divided by the number of values 107941.93
Standard A measure of how widely values are dispersed from the average value 59904.19
Deviation
Kurtosis Value that describes the relative peakedness or flatness of a distribution 0.112
where a positive value indicates a relatively peaked distribution and a negative
value indicates a relatively flat distribution
Skewness A measure of the asymmetry of a distribution around the average where a 0.731
positive value indicates a distribution with an asymmetric tail extending
toward more positive values and a negative value indicates a distribution with
an asymmetric tail extending toward more negative values
Mode The most frequently occurring value 23958
Median This the number in the middle where, half the numbers have values that are 97909.5
greater than the median and half have values that are less – also called the
50th percentile
• Be careful what statistics are used
• Do not generate statistics just because you can
• The use of statistics can give a false impression of certainty or meaning where there is none
May 18, 2010 12
13. Interpreting the Statistics
Statistic Value Interpretation
Average 107941.93 The average is higher than the median indicating that the data is
dispersed unequally towards higher values
Standard Deviation 59904.19 The high standard deviation indicates the underlying data is spread
across a wide range of values
Kurtosis 0.112 The positive value indicates that there is a peak in the data
Skewness 0.731 The positive values indicates a distribution with an unequal and
heavy tail extending toward more higher values
Mode 23958 In a large set of data where only a small number of data values are
the same, this is meaningless
Median 97909.5 When the median is less than the average, it means the data is
unequally distributed with a heavy tail extending toward more
higher values
• I now know that the data is skewed towards lower values and has a
heavy tail indicating a small number of people earning large salaries
May 18, 2010 13
14. Number of People
0
10
20
30
40
50
0 60
May 18, 2010
20
00
0
40
00
0
60
00
0
80
00
0
10
00
00
12
00
00
14
00
00
16
00
00
Let’s Take a Look at the Data
18
00
Annual Salary
00
20
00
00
22
00
00
24
00
00
26
00
00
28
00
00
30
00
00
14
15. Let’s Take a Look at the Data
Clustered
Increases around
quickly Gradual drop
lower values from peak
• Characteristics from zero
60
− Increases quickly from
zero 50
− Distribution skewed to
the left 40
Number of People
− Clustered around lower Heavy tail
values 30
− Gradual drop from
20
peak
− Heavy tail 10
• This type of data
0
distribution is very
0
0
0
0
0
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
common
20
40
60
80
10
12
14
16
18
20
22
24
26
28
30
Annual Salary
Distribution
skewed to the left
May 18, 2010 15
16. Statistics
0.4
• The usefulness of a statistic
0.35
depends on the underlying data
0.3
• Average really only makes
sense when the data is 0.25
symmetrically/equally 0.2
distributed
0.15
− Otherwise, the average is distorted
because of unequal distribution of 0.1
data
0.05
• Deviation also really only makes 0
sense when the data is -5
-4.5
-4.1
-3.6
-3.2
-2.7
-2.2
-1.8
-1.3
-0.9
-0.4
0.06
0.52
0.98
1.44
1.9
2.36
2.82
3.28
3.74
4.2
4.66
symmetrically distributed
May 18, 2010 16
17. Statistics
• Be careful of obscure statistics such as Kurtosis and
Skewness
• They have a use but the meaning is quite specific and may
not be appropriate
May 18, 2010 17
18. Descriptive Statistics
• Look for statistics that contain
− Measures of data location and clustering
− Measures of dispersion and variability
− Measures of association
• Look at the underlying data, how it was collected, what it
measures
− If the data is of poor quality or measures the wrong values, any
derived information will have very limited worth
• There are lots of statistics that can be produced from the
raw data
− Produce only meaningful statistics
− Do not throw statistics at the data
May 18, 2010 18
19. Some Common Descriptive and Summarising
Statistics
Statistic Type Statistic Description
Data location and Clustering Average Simple average
Weighted Average Average of values weighted according
to a value such as their importance
Truncated/Interpercentile Average Average of centralised subset of data
Median The 50th percentile
Mode The most commonly occurring value
Dispersion, Variability and Shape Variance Measure of the amount of variation
within the data
Standard Deviation Square root of the Variance
Range The spread of the data values
Skewness Measure of the asymmetry of the
distribution of the data
Kurtosis Measure of the "peakedness” and the
length of the tail of the distribution of
the data
Percentiles Value below which a certain percent of
the data fall
Association Correlation Correlation has a specific meaning that
may not be relevant to the data
May 18, 2010 19
20. Another Look at the Sample Data
320000
300000
280000
260000
240000
Annual Salary
220000
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0%
5%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
0%
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
10
Percentage Earning Up to Salary Amount
• This shows the salaries of cumulative percentages of the
people surveyed
May 18, 2010 20
22. Percentiles
• Percentile of a set of data is the number or value below
which that percent of data lies
• Median = 50th percentile
− Value below which 50% of data lies
• Quartiles are percentiles for 25%, 50% and 75%
• Percentiles are useful in summarising data
May 18, 2010 22
23. Percentiles for Sample Data
• This … • … becomes this …
• 4,000 numbers reduced to 10 numbers
− 10% of people earn 38,332 or less
− 20% of people earn 54,834 or less
− 10% of people earn between 192,871 and 299,433
• Successfully reduced the volume of data while preserving more information
May 18, 2010 23
24. Anti-Statistics
• Unfortunately everywhere
• Take a number of general forms or types such as
− Statement based on measurement of incorrect value
− Statement without scale or reference
− Statement based on grouping of categories (with possible
distortion of categories)
− Statements based on inaccurate on unspecified association or
correlation
May 18, 2010 24
25. Sample Type 1 Anti-Statistic
• Chimpanzee DNA is 99.7% the same as Human DNA
• What does this statement mean?
− Do chimpanzees make cars/houses/PCs/etc. that are 99.7% as
good as those made by humans?
• If the statement is true then what is being measured may
be invalid, such as
• 000000000000000000000000 and 000000000000000000000001
• These numbers are 99% the same based on the length of the lines in their
characters
− Or
• A lot of DNA is not involved in the development process and this is being
included in measurements
− Or
• A small change in DNA has a substantial impact on what is produced
May 18, 2010 25
26. Sample Type 2 Anti-Statistic
• Statements of the form
− X is the greatest cause of Y, such as
• Car crashes are the greatest cause of deaths among males in their 20s and
30s
• Meaningless because there is no scale or reference point
• Statement creates an impression of scale and severity that
is at best not justified or at worst incorrect
• Take a look at the underlying life expectancy data
May 18, 2010 26
27. Type 2 Anti-Statistic
• Probability of a person dying • Probability of a person dying
within a year at each year of life within a year for first 35 years
0.6 0.0045
0.004
Probability of Dying Within One Year
Probability of Dying Within One Year
0.5
0.0035
0.4 0.003
0.0025
0.3
0.002
0.2 0.0015
0.001
0.1
0.0005
0 0
20 Yea s
25 ea s
30 Yea s
35 Yea s
Y rs
45 Yea s
rs
55 Yea s
60 Yea s
rs
70 Yea s
rs
80 Yea s
85 Yea s
rs
95 Yea s
10 Ye rs
10 Ye rs
5 ars
s
rs
5 0
0 5 10 15 20 25 30 35
r
Y r
r
r
r
r
r
r
r
r
r
ar
15 Yea
40 ea
50 Yea
65 Yea
75 Yea
90 Yea
0 a
10 Yea
Ye
Years Years Years Years Years Years Years
May 18, 2010 27
28. Type 2 Anti-Statistic
• The underlying life expectancy data shows that young
people have very little chance of dying
• Death rates are uniformly very low after the first year of
life until about age 50
• So a statement such as
− Car crashes are the greatest cause of deaths among males in their
20s and 30s
• Will inevitably be true because nothing else really kills
young males
− Death due to illness is uncommon among this group so any other
cause will dominate
May 18, 2010 28
29. Sample Type 3 Anti-Statistic
• Statements of the form
− N% of people do/have done X at least N times/with defined frequency
− Typically arise as the results of tendentious surveys designed to create a false
impression of severity
• Such as
− 75% of people admit to X up to N times a year
• No indication of how the 75% is spread across the range of 1 to N times
− 65% of people admit to having a negative experience up to N times due to X
• No indication of the spread of negative experiences across the range of 1 to N
• Generally a result of combining the responses to two or more
questions or categories
− Have often have you done/experienced X?
• Once
• Twice
• Three times
• …
May 18, 2010 29
30. Type 3 Anti-Statistic
• Have often have you • Have often have you
done/experienced X? done/experienced X?
− Once − 45%
− Twice − 10%
− Three times − 8%
− 4-8 times − 5%
− 8-12 times − 2%
• Total of these is 75%
• Statement that 75% of people
have done/experienced X up to
12 times a year distorts the
distribution of the underlying
data that is skewed towards
lower rates of occurrence
May 18, 2010 30
31. Sample Type 4 Anti-Statistic
• Statements of the form
− Taking /doing A makes you N% more likely to be/experience B
• Two issues
− Causation – is there a real causal relationship
− Degree of causation – how strong is the causal relationship
• An association does not imply a causation
− A might cause B
− B might cause A
− A might cause B and B might cause A
− A might cause C that might cause B
− A might cause C that might cause D … that might cause B
− A might cause C that might cause B and A might cause D that might not cause B but A-C-
D causation is greater than A-D-B negative causation
− Measuring error
− Random data that was skewed
− Deliberate or malicious misrepresentation
• Cause might be partial or contributory
• Be careful of any statement of a relationship that does not demonstrate how
causation happens
May 18, 2010 31
32. Association and Causation Scenarios
Causes or Influences
A B A B
Causes or Influences Causes or Influences
A B
C D
A Causes or Influences B D
Negatively
Causes or
A B A Influences B
Causes or Influences Causes or Influences
C
C
May 18, 2010 32
33. Association and Causation
• Very common scenario where an association or causation
is asserted
Takes or Taking or Doing
Does D D Affects or
Causes B
A B
May 18, 2010 33
34. Association and Causation
• The real association or causation is actually along the lines
of:
Takes or Taking or Doing D Has
Does D Little or No Effect or
Influence on B or Even
Members of Negatively Impacts B
Group C Have
a Greater
Tendency to
A Take or do D B
Members of
Group C Also
Take or Do E
Taking or Doing E
Is a Affects or Causes
Member of B
a Group
C
E
May 18, 2010 34
35. Type 4 Anti-Statistic
• Occurs very frequently
• A percentage association can give a false sense of certainty
− Just measures the looseness of association
• Often misrepresents the degree of causation
• Unless the precise nature of the causative relationship can
be defined, take with a large dose of salt
May 18, 2010 35
36. Summary
• Statistics are designed to provide insight without distorting
the meaning of the underlying data or losing information
• Anti-statistics are used to distort the underlying data to
create false impressions
• So there are Lies, Damn Lies and Anti-Statistics
May 18, 2010 36
37. More Information
Alan McSweeney
alan@alanmcsweeney.com
May 18, 2010 37