SlideShare a Scribd company logo
1 of 37
Download to read offline
Lies, Damn Lies and Anti-
Statistics


Alan McSweeney
Objective

•   Introduce the concept of distorting “anti-statistics”,
    illustrate how “anti-statistics” can be identified and define
    how statistics should be constructed to yield insight and
    meaning




    May 18, 2010                                                    2
Statistics

•   A statistic has two roles - primary and secondary
      − Primary - to summarise and describe the data while preserving
        information and reducing the volume of raw data
      − Secondary - to provide and enable insight
•   Where an alleged statistic does not perform these
    functions it is an “anti-statistic”
      − Distorting the underlying information (raw data), either
        deliberately or accidentally
      − Not providing insight or providing an inaccurate view of the
        underlying information
•   Most people are scared of large sets of numbers
      − The use of anti-statistics uses this fear
    May 18, 2010                                                        3
Statistics and Anti-Statistics

• Statistics                     • Anti-Statistics


•   Descriptive                  •   Distorting
•   Insightful                   •   Promoting Misinterpretation
•   Informative                  •   Misinformative
•   Enlightening                 •   Concealing




    May 18, 2010                                                   4
Statistics - Primary Function

•   To describe the data while preserving information and
    reducing the volume of raw data


•   This means taking a large amount of raw data, producing
    descriptive summaries while not losing or distorting the
    underlying raw data


•   More important function of a statistic



    May 18, 2010                                               5
Statistics - Secondary Function

•   To provide and enable insight
•   By reducing the volume of raw data, you can gain insight
    into what the data means
      − Enabling you to see the wood from the trees, know the amount
        and type of wood and make decisions about the use of the wood
•   Secondary function if primary function satisfied




    May 18, 2010                                                        6
Data, Information, Knowledge and Action Cycle

•   Good         Knowledge
    statistics
    provide                                     Action
    information
    that creates
    knowledge
    and enables
    correct
    actions


                   Information
                                                Data


    May 18, 2010                                       7
Information – Lots of It




 May 18, 2010              8
Sample Information

•   4,000 numbers representing the annual salaries of
    individuals
      − Sample data only
•   100% of the information is available here
•   Very hard to see patterns, understand the situation, gain
    insight and make effective decisions and understand their
    consequences
•   The numbers do not lie but they are innocent creatures
    and can be made to lie
•   Need techniques that extract meaning and provide insight
    without losing the information the data represents
    May 18, 2010                                                9
Statistics

•   I can take all this …




•   … And give you one derived number (average)
      − 107941.931

    May 18, 2010                                  10
Statistic

•   4,000 numbers reduced to 1
•   Reduced the amount of data by 99.975% (another
    “statistic”)
•   But I have lost information
•   Average value of 107941.931 is at best a simplistic view of
    the data and at worst a distortion that misrepresents the
    source data
•   If I use the average without looking to understand the raw
    data in more detail I am potentially creating a distortion


    May 18, 2010                                                  11
More Statistics
    Average        Sum of all the values divided by the number of values                              107941.93

    Standard       A measure of how widely values are dispersed from the average value                 59904.19
    Deviation
    Kurtosis       Value that describes the relative peakedness or flatness of a distribution             0.112
                   where a positive value indicates a relatively peaked distribution and a negative
                   value indicates a relatively flat distribution
    Skewness       A measure of the asymmetry of a distribution around the average where a                0.731
                   positive value indicates a distribution with an asymmetric tail extending
                   toward more positive values and a negative value indicates a distribution with
                   an asymmetric tail extending toward more negative values
    Mode           The most frequently occurring value                                                   23958

    Median         This the number in the middle where, half the numbers have values that are           97909.5
                   greater than the median and half have values that are less – also called the
                   50th percentile

•   Be careful what statistics are used
•   Do not generate statistics just because you can
•   The use of statistics can give a false impression of certainty or meaning where there is none

    May 18, 2010                                                                                              12
Interpreting the Statistics
    Statistic                Value Interpretation

    Average              107941.93 The average is higher than the median indicating that the data is
                                   dispersed unequally towards higher values
    Standard Deviation    59904.19 The high standard deviation indicates the underlying data is spread
                                   across a wide range of values
    Kurtosis                 0.112 The positive value indicates that there is a peak in the data

    Skewness                 0.731 The positive values indicates a distribution with an unequal and
                                   heavy tail extending toward more higher values
    Mode                    23958 In a large set of data where only a small number of data values are
                                   the same, this is meaningless
    Median                 97909.5 When the median is less than the average, it means the data is
                                   unequally distributed with a heavy tail extending toward more
                                   higher values


•   I now know that the data is skewed towards lower values and has a
    heavy tail indicating a small number of people earning large salaries

    May 18, 2010                                                                                         13
Number of People




                                           0
                                               10
                                                    20
                                                            30
                                                                    40
                                                                         50
                                       0                                      60




May 18, 2010
                                20
                                  00
                                       0
                                40
                                  00
                                       0
                                60
                                  00
                                       0
                                80
                                  00
                                       0
                               10
                                 00
                                      00
                               12
                                 00
                                      00
                               14
                                 00
                                      00
                               16
                                 00
                                      00
                                                                                   Let’s Take a Look at the Data




                               18
                                 00




               Annual Salary
                                      00
                               20
                                 00
                                      00
                               22
                                 00
                                      00
                               24
                                 00
                                      00
                               26
                                 00
                                      00
                               28
                                 00
                                      00
                               30
                                 00
                                      00
14
Let’s Take a Look at the Data
                                                                                           Clustered
                                                         Increases                          around
                                                           quickly                                    Gradual drop
                                                                                         lower values  from peak
•   Characteristics                                      from zero
                                                    60
      − Increases quickly from
        zero                                        50

      − Distribution skewed to
        the left                                    40




                                 Number of People
      − Clustered around lower                                                                                                                  Heavy tail
        values                                      30


      − Gradual drop from
                                                    20
        peak
      − Heavy tail                                  10


•   This type of data
                                                     0
    distribution is very
                                                    0


                                                           0


                                                                  0


                                                                         0


                                                                                0

                                                                                        00


                                                                                               00


                                                                                                      00


                                                                                                             00


                                                                                                                    00


                                                                                                                           00


                                                                                                                                  00


                                                                                                                                         00


                                                                                                                                                00


                                                                                                                                                       00


                                                                                                                                                              00
                                                           00


                                                                  00


                                                                         00


                                                                                00

                                                                                        00


                                                                                               00


                                                                                                      00


                                                                                                             00


                                                                                                                    00


                                                                                                                           00


                                                                                                                                  00


                                                                                                                                         00


                                                                                                                                                00


                                                                                                                                                       00


                                                                                                                                                              00
    common
                                                         20


                                                                40


                                                                       60


                                                                              80

                                                                                      10


                                                                                             12


                                                                                                    14


                                                                                                           16


                                                                                                                  18


                                                                                                                         20


                                                                                                                                22


                                                                                                                                       24


                                                                                                                                              26


                                                                                                                                                     28


                                                                                                                                                            30
                                                                                                      Annual Salary


                                                                                        Distribution
                                                                                     skewed to the left
    May 18, 2010                                                                                                                                                   15
Statistics
                                               0.4

•   The usefulness of a statistic
                                              0.35
    depends on the underlying data
                                               0.3
•   Average really only makes
    sense when the data is                    0.25


    symmetrically/equally                      0.2

    distributed
                                              0.15
      − Otherwise, the average is distorted
        because of unequal distribution of     0.1

        data
                                              0.05

•   Deviation also really only makes            0
    sense when the data is                           -5
                                                          -4.5
                                                                 -4.1
                                                                        -3.6
                                                                               -3.2
                                                                                      -2.7
                                                                                             -2.2
                                                                                                    -1.8
                                                                                                           -1.3
                                                                                                                  -0.9
                                                                                                                         -0.4
                                                                                                                                0.06
                                                                                                                                       0.52
                                                                                                                                              0.98
                                                                                                                                                     1.44
                                                                                                                                                            1.9
                                                                                                                                                                  2.36
                                                                                                                                                                         2.82
                                                                                                                                                                                3.28
                                                                                                                                                                                       3.74
                                                                                                                                                                                              4.2
                                                                                                                                                                                                    4.66
    symmetrically distributed


    May 18, 2010                                                                                                                                                                                           16
Statistics

•   Be careful of obscure statistics such as Kurtosis and
    Skewness
•   They have a use but the meaning is quite specific and may
    not be appropriate




    May 18, 2010                                                17
Descriptive Statistics

•   Look for statistics that contain
      − Measures of data location and clustering
      − Measures of dispersion and variability
      − Measures of association
•   Look at the underlying data, how it was collected, what it
    measures
      − If the data is of poor quality or measures the wrong values, any
        derived information will have very limited worth
•   There are lots of statistics that can be produced from the
    raw data
      − Produce only meaningful statistics
      − Do not throw statistics at the data

    May 18, 2010                                                           18
Some Common Descriptive and Summarising
Statistics
Statistic Type                      Statistic                           Description
Data location and Clustering        Average                             Simple average
                                    Weighted Average                    Average of values weighted according
                                                                        to a value such as their importance
                                    Truncated/Interpercentile Average   Average of centralised subset of data
                                    Median                              The 50th percentile
                                    Mode                                The most commonly occurring value
Dispersion, Variability and Shape   Variance                            Measure of the amount of variation
                                                                        within the data
                                    Standard Deviation                  Square root of the Variance
                                    Range                               The spread of the data values
                                    Skewness                            Measure of the asymmetry of the
                                                                        distribution of the data
                                    Kurtosis                            Measure of the "peakedness” and the
                                                                        length of the tail of the distribution of
                                                                        the data
                                    Percentiles                         Value below which a certain percent of
                                                                        the data fall
Association                         Correlation                         Correlation has a specific meaning that
                                                                        may not be relevant to the data


   May 18, 2010                                                                                                     19
Another Look at the Sample Data
                               320000
                               300000
                               280000
                               260000
                               240000
               Annual Salary



                               220000
                               200000
                               180000
                               160000
                               140000
                               120000
                               100000
                                80000
                                60000
                                40000
                                20000
                                    0
                                   0%

                                        5%

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %

                                                %
                                               0%
                                              10

                                              15

                                              20

                                              25

                                              30

                                              35

                                              40

                                              45

                                              50

                                              55

                                              60

                                              65

                                              70

                                              75

                                              80

                                              85

                                              90

                                              95
                                             10
                                             Percentage Earning Up to Salary Amount


•   This shows the salaries of cumulative percentages of the
    people surveyed

    May 18, 2010                                                                      20
Another Look at the Sample Data
               290000 - 300000       15                                                                                                               290000 - 300000      0.4%
               280000 - 290000       17                                                                                                               280000 - 290000      0.4%
               270000 - 280000        20                                                                                                              270000 - 280000       0.5%
               260000 - 270000        22                                                                                                              260000 - 270000       0.6%
               250000 - 260000            27                                                                                                          250000 - 260000        0.7%
               240000 - 250000             32                                                                                                         240000 - 250000         0.8%
               230000 - 240000                 38                                                                                                     230000 - 240000             1.0%
               220000 - 230000                  47                                                                                                    220000 - 230000               1.2%
               210000 - 220000                       55                                                                                               210000 - 220000                1.4%
               200000 - 210000                            67                                                                                          200000 - 210000                      1.7%
               190000 - 200000                                 84                                                                                     190000 - 200000                         2.1%
               180000 - 190000                                      96                                                                                180000 - 190000                             2.4%
               170000 - 180000                                           112                                                                          170000 - 180000                                 2.8%




                                                                                                                                       Salary Range
Salary Range




               160000 - 170000                                                 128                                                                    160000 - 170000                                    3.2%
               150000 - 160000                                                       146                                                              150000 - 160000                                          3.7%
               140000 - 150000                                                             166                                                        140000 - 150000                                                 4.2%
               130000 - 140000                                                                   187                                                  130000 - 140000                                                        4.7%
               120000 - 130000                                                                          209                                           120000 - 130000                                                               5.2%
               110000 - 120000                                                                                230                                     110000 - 120000                                                                      5.8%
               100000 - 110000                                                                                      249                               100000 - 110000                                                                             6.2%
                90000 - 100000                                                                                            267                          90000 - 100000                                                                                6.7%
                 80000 - 90000                                                                                              280                         80000 - 90000                                                                                      7.0%
                 70000 - 80000                                                                                                  285                     70000 - 80000                                                                                       7.1%
                 60000 - 70000                                                                                                  283                     60000 - 70000                                                                                       7.1%
                 50000 - 60000                                                                                            268                           50000 - 60000                                                                                    6.7%
                 40000 - 50000                                                                                 237                                      40000 - 50000                                                                       5.9%
                 30000 - 40000                                                                    193                                                   30000 - 40000                                                         4.8%
                 20000 - 30000                                                  133                                                                     20000 - 30000                                        3.3%
                 10000 - 20000                                 83                                                                                       10000 - 20000                         2.1%
                     0 - 10000            24                                                                                                                0 - 10000       0.6%

                                 0              50              100              150             200            250              300                                0.0%     1.0%          2.0%      3.0%      4.0%          5.0%      6.0%          7.0%          8.0%

                                                                    Number of People                                                                                                              Percentage of People
               May 18, 2010                                                                                                                                                                                                                                               21
Percentiles

•   Percentile of a set of data is the number or value below
    which that percent of data lies
•   Median = 50th percentile
      − Value below which 50% of data lies
•   Quartiles are percentiles for 25%, 50% and 75%
•   Percentiles are useful in summarising data




    May 18, 2010                                               22
Percentiles for Sample Data

•   This …                                                        •   … becomes this …




•   4,000 numbers reduced to 10 numbers
      − 10% of people earn 38,332 or less
      − 20% of people earn 54,834 or less
      − 10% of people earn between 192,871 and 299,433
•   Successfully reduced the volume of data while preserving more information
    May 18, 2010                                                                         23
Anti-Statistics

•   Unfortunately everywhere
•   Take a number of general forms or types such as
      − Statement based on measurement of incorrect value
      − Statement without scale or reference
      − Statement based on grouping of categories (with possible
        distortion of categories)
      − Statements based on inaccurate on unspecified association or
        correlation




    May 18, 2010                                                       24
Sample Type 1 Anti-Statistic

• Chimpanzee DNA is 99.7% the same as Human DNA
• What does this statement mean?
      − Do chimpanzees make cars/houses/PCs/etc. that are 99.7% as
        good as those made by humans?
•   If the statement is true then what is being measured may
    be invalid, such as
             • 000000000000000000000000 and 000000000000000000000001
             • These numbers are 99% the same based on the length of the lines in their
               characters
      − Or
             • A lot of DNA is not involved in the development process and this is being
               included in measurements
      − Or
             • A small change in DNA has a substantial impact on what is produced

    May 18, 2010                                                                           25
Sample Type 2 Anti-Statistic

•   Statements of the form
      − X is the greatest cause of Y, such as
             • Car crashes are the greatest cause of deaths among males in their 20s and
               30s

•   Meaningless because there is no scale or reference point
•   Statement creates an impression of scale and severity that
    is at best not justified or at worst incorrect
•   Take a look at the underlying life expectancy data




    May 18, 2010                                                                           26
Type 2 Anti-Statistic

•                               Probability of a person dying        •                             Probability of a person dying
                                within a year at each year of life                                 within a year for first 35 years
                                       0.6                                                                  0.0045

                                                                                                             0.004
Probability of Dying Within One Year




                                                                     Probability of Dying Within One Year
                                       0.5
                                                                                                            0.0035

                                       0.4                                                                   0.003

                                                                                                            0.0025
                                       0.3
                                                                                                             0.002

                                       0.2                                                                  0.0015

                                                                                                             0.001
                                       0.1
                                                                                                            0.0005

                                        0                                                                       0
                                         20 Yea s
                                         25 ea s
                                         30 Yea s
                                         35 Yea s
                                            Y rs
                                         45 Yea s
                                                rs

                                         55 Yea s
                                         60 Yea s
                                                rs

                                         70 Yea s
                                                rs

                                         80 Yea s
                                         85 Yea s
                                                rs

                                         95 Yea s
                                        10 Ye rs
                                        10 Ye rs
                                          5 ars
                                                 s
                                                rs
                                          5 0




                                                                                                                     0     5    10    15    20    25    30    35
                                                r
                                            Y r
                                                r
                                                r

                                                r

                                                r
                                                r

                                                r

                                                r
                                                r

                                                r



                                              ar
                                         15 Yea




                                         40 ea

                                         50 Yea



                                         65 Yea

                                         75 Yea



                                         90 Yea


                                          0 a
                                         10 Yea




                                            Ye




                                                                                                                         Years Years Years Years Years Years Years
                           May 18, 2010                                                                                                                              27
Type 2 Anti-Statistic

•   The underlying life expectancy data shows that young
    people have very little chance of dying
•   Death rates are uniformly very low after the first year of
    life until about age 50
•   So a statement such as
      − Car crashes are the greatest cause of deaths among males in their
        20s and 30s
•   Will inevitably be true because nothing else really kills
    young males
      − Death due to illness is uncommon among this group so any other
        cause will dominate

    May 18, 2010                                                            28
Sample Type 3 Anti-Statistic

•   Statements of the form
      − N% of people do/have done X at least N times/with defined frequency
      − Typically arise as the results of tendentious surveys designed to create a false
        impression of severity
•   Such as
      − 75% of people admit to X up to N times a year
             • No indication of how the 75% is spread across the range of 1 to N times
      − 65% of people admit to having a negative experience up to N times due to X
             • No indication of the spread of negative experiences across the range of 1 to N
•   Generally a result of combining the responses to two or more
    questions or categories
      − Have often have you done/experienced X?
             •     Once
             •     Twice
             •     Three times
             •     …
    May 18, 2010                                                                                29
Type 3 Anti-Statistic

•   Have often have you   •   Have often have you
    done/experienced X?       done/experienced X?
      −   Once                −   45%
      −   Twice               −   10%
      −   Three times         −   8%
      −   4-8 times           −   5%
      −   8-12 times          −   2%
                          •   Total of these is 75%
                          •   Statement that 75% of people
                              have done/experienced X up to
                              12 times a year distorts the
                              distribution of the underlying
                              data that is skewed towards
                              lower rates of occurrence
    May 18, 2010                                               30
Sample Type 4 Anti-Statistic

•   Statements of the form
      − Taking /doing A makes you N% more likely to be/experience B
•   Two issues
      − Causation – is there a real causal relationship
      − Degree of causation – how strong is the causal relationship
•   An association does not imply a causation
      − A might cause B
      − B might cause A
      − A might cause B and B might cause A
      − A might cause C that might cause B
      − A might cause C that might cause D … that might cause B
      − A might cause C that might cause B and A might cause D that might not cause B but A-C-
        D causation is greater than A-D-B negative causation
      − Measuring error
      − Random data that was skewed
      − Deliberate or malicious misrepresentation
•   Cause might be partial or contributory
•   Be careful of any statement of a relationship that does not demonstrate how
    causation happens
    May 18, 2010                                                                                 31
Association and Causation Scenarios
                Causes or Influences

          A                            B   A                          B

                Causes or Influences           Causes or Influences
          A                            B
                                                  C           D

          A     Causes or Influences   B               D
                                                        Negatively
                                                         Causes or
          A                            B   A            Influences    B
                Causes or Influences           Causes or Influences

                                                       C
                         C
 May 18, 2010                                                             32
Association and Causation

•   Very common scenario where an association or causation
    is asserted
                   Takes or              Taking or Doing
                    Does      D            D Affects or
                                            Causes B



                   A                        B




    May 18, 2010                                             33
Association and Causation

•   The real association or causation is actually along the lines
    of:
                     Takes or                                         Taking or Doing D Has
                      Does                     D                       Little or No Effect or
                                                                      Influence on B or Even
                                 Members of                            Negatively Impacts B
                                Group C Have
                                  a Greater
                                Tendency to
                    A           Take or do D                           B
                                                   Members of
                                                   Group C Also
                                                   Take or Do E
                                                                               Taking or Doing E
                      Is a                                                     Affects or Causes
                   Member of                                                           B
                    a Group
                                               C
                                                                  E
    May 18, 2010                                                                                   34
Type 4 Anti-Statistic

•   Occurs very frequently
•   A percentage association can give a false sense of certainty
      − Just measures the looseness of association
•   Often misrepresents the degree of causation
•   Unless the precise nature of the causative relationship can
    be defined, take with a large dose of salt




    May 18, 2010                                                   35
Summary

•   Statistics are designed to provide insight without distorting
    the meaning of the underlying data or losing information
•   Anti-statistics are used to distort the underlying data to
    create false impressions
•   So there are Lies, Damn Lies and Anti-Statistics




    May 18, 2010                                                    36
More Information

          Alan McSweeney
          alan@alanmcsweeney.com




 May 18, 2010                      37

More Related Content

More from Alan McSweeney

Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...Alan McSweeney
 
IT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdfIT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdfAlan McSweeney
 
Solution Architecture And Solution Security
Solution Architecture And Solution SecuritySolution Architecture And Solution Security
Solution Architecture And Solution SecurityAlan McSweeney
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Alan McSweeney
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Alan McSweeney
 
Solution Security Architecture
Solution Security ArchitectureSolution Security Architecture
Solution Security ArchitectureAlan McSweeney
 
Solution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation SolutionsSolution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation SolutionsAlan McSweeney
 
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationAlan McSweeney
 
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...Alan McSweeney
 
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...Alan McSweeney
 
Operational Risk Management Data Validation Architecture
Operational Risk Management Data Validation ArchitectureOperational Risk Management Data Validation Architecture
Operational Risk Management Data Validation ArchitectureAlan McSweeney
 
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...Alan McSweeney
 
Ireland 2019 and 2020 Compared - Individual Charts
Ireland   2019 and 2020 Compared - Individual ChartsIreland   2019 and 2020 Compared - Individual Charts
Ireland 2019 and 2020 Compared - Individual ChartsAlan McSweeney
 
Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020Alan McSweeney
 
Ireland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In DataIreland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In DataAlan McSweeney
 
Review of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability ModelsReview of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability ModelsAlan McSweeney
 
Critical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference ArchitectureCritical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference ArchitectureAlan McSweeney
 
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020Alan McSweeney
 
Agile Solution Architecture and Design
Agile Solution Architecture and DesignAgile Solution Architecture and Design
Agile Solution Architecture and DesignAlan McSweeney
 
Solution Architecture and Solution Acquisition
Solution Architecture and Solution AcquisitionSolution Architecture and Solution Acquisition
Solution Architecture and Solution AcquisitionAlan McSweeney
 

More from Alan McSweeney (20)

Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
Analysis of the Numbers of Catholic Clergy and Members of Religious in Irelan...
 
IT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdfIT Architecture’s Role In Solving Technical Debt.pdf
IT Architecture’s Role In Solving Technical Debt.pdf
 
Solution Architecture And Solution Security
Solution Architecture And Solution SecuritySolution Architecture And Solution Security
Solution Architecture And Solution Security
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
 
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
Data Privatisation, Data Anonymisation, Data Pseudonymisation and Differentia...
 
Solution Security Architecture
Solution Security ArchitectureSolution Security Architecture
Solution Security Architecture
 
Solution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation SolutionsSolution Architecture And (Robotic) Process Automation Solutions
Solution Architecture And (Robotic) Process Automation Solutions
 
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata Harmonisation
 
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
Comparison of COVID-19 Mortality Data and Deaths for Ireland March 2020 – Mar...
 
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
Analysis of Decentralised, Distributed Decision-Making For Optimising Domesti...
 
Operational Risk Management Data Validation Architecture
Operational Risk Management Data Validation ArchitectureOperational Risk Management Data Validation Architecture
Operational Risk Management Data Validation Architecture
 
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
Data Integration, Access, Flow, Exchange, Transfer, Load And Extract Architec...
 
Ireland 2019 and 2020 Compared - Individual Charts
Ireland   2019 and 2020 Compared - Individual ChartsIreland   2019 and 2020 Compared - Individual Charts
Ireland 2019 and 2020 Compared - Individual Charts
 
Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020Analysis of Irish Mortality Using Public Data Sources 2014-2020
Analysis of Irish Mortality Using Public Data Sources 2014-2020
 
Ireland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In DataIreland – 2019 And 2020 Compared In Data
Ireland – 2019 And 2020 Compared In Data
 
Review of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability ModelsReview of Information Technology Function Critical Capability Models
Review of Information Technology Function Critical Capability Models
 
Critical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference ArchitectureCritical Review of Open Group IT4IT Reference Architecture
Critical Review of Open Group IT4IT Reference Architecture
 
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
Analysis of Possible Excess COVID-19 Deaths in Ireland From Jan 2020 to Jun 2020
 
Agile Solution Architecture and Design
Agile Solution Architecture and DesignAgile Solution Architecture and Design
Agile Solution Architecture and Design
 
Solution Architecture and Solution Acquisition
Solution Architecture and Solution AcquisitionSolution Architecture and Solution Acquisition
Solution Architecture and Solution Acquisition
 

Recently uploaded

Best Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaBest Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaShree Krishna Exports
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMANIlamathiKannappan
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...Paul Menig
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876dlhescort
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxpriyanshujha201
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Servicediscovermytutordmt
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear RegressionRavindra Nath Shukla
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...Aggregage
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insightsseri bangash
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 DelhiCall Girls in Delhi
 
Unlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfUnlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfOnline Income Engine
 
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...Suhani Kapoor
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 

Recently uploaded (20)

Best Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in IndiaBest Basmati Rice Manufacturers in India
Best Basmati Rice Manufacturers in India
 
A DAY IN THE LIFE OF A SALESMAN / WOMAN
A DAY IN THE LIFE OF A  SALESMAN / WOMANA DAY IN THE LIFE OF A  SALESMAN / WOMAN
A DAY IN THE LIFE OF A SALESMAN / WOMAN
 
7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...7.pdf This presentation captures many uses and the significance of the number...
7.pdf This presentation captures many uses and the significance of the number...
 
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
Call Girls in Delhi, Escort Service Available 24x7 in Delhi 959961-/-3876
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptxB.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
B.COM Unit – 4 ( CORPORATE SOCIAL RESPONSIBILITY ( CSR ).pptx
 
Call Girls in Gomti Nagar - 7388211116 - With room Service
Call Girls in Gomti Nagar - 7388211116  - With room ServiceCall Girls in Gomti Nagar - 7388211116  - With room Service
Call Girls in Gomti Nagar - 7388211116 - With room Service
 
Regression analysis: Simple Linear Regression Multiple Linear Regression
Regression analysis:  Simple Linear Regression Multiple Linear RegressionRegression analysis:  Simple Linear Regression Multiple Linear Regression
Regression analysis: Simple Linear Regression Multiple Linear Regression
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
The Path to Product Excellence: Avoiding Common Pitfalls and Enhancing Commun...
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insights
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
9599632723 Top Call Girls in Delhi at your Door Step Available 24x7 Delhi
 
Unlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdfUnlocking the Secrets of Affiliate Marketing.pdf
Unlocking the Secrets of Affiliate Marketing.pdf
 
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
VIP Call Girls Gandi Maisamma ( Hyderabad ) Phone 8250192130 | ₹5k To 25k Wit...
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 

Lies, Damn Lies And Anti Statistics

  • 1. Lies, Damn Lies and Anti- Statistics Alan McSweeney
  • 2. Objective • Introduce the concept of distorting “anti-statistics”, illustrate how “anti-statistics” can be identified and define how statistics should be constructed to yield insight and meaning May 18, 2010 2
  • 3. Statistics • A statistic has two roles - primary and secondary − Primary - to summarise and describe the data while preserving information and reducing the volume of raw data − Secondary - to provide and enable insight • Where an alleged statistic does not perform these functions it is an “anti-statistic” − Distorting the underlying information (raw data), either deliberately or accidentally − Not providing insight or providing an inaccurate view of the underlying information • Most people are scared of large sets of numbers − The use of anti-statistics uses this fear May 18, 2010 3
  • 4. Statistics and Anti-Statistics • Statistics • Anti-Statistics • Descriptive • Distorting • Insightful • Promoting Misinterpretation • Informative • Misinformative • Enlightening • Concealing May 18, 2010 4
  • 5. Statistics - Primary Function • To describe the data while preserving information and reducing the volume of raw data • This means taking a large amount of raw data, producing descriptive summaries while not losing or distorting the underlying raw data • More important function of a statistic May 18, 2010 5
  • 6. Statistics - Secondary Function • To provide and enable insight • By reducing the volume of raw data, you can gain insight into what the data means − Enabling you to see the wood from the trees, know the amount and type of wood and make decisions about the use of the wood • Secondary function if primary function satisfied May 18, 2010 6
  • 7. Data, Information, Knowledge and Action Cycle • Good Knowledge statistics provide Action information that creates knowledge and enables correct actions Information Data May 18, 2010 7
  • 8. Information – Lots of It May 18, 2010 8
  • 9. Sample Information • 4,000 numbers representing the annual salaries of individuals − Sample data only • 100% of the information is available here • Very hard to see patterns, understand the situation, gain insight and make effective decisions and understand their consequences • The numbers do not lie but they are innocent creatures and can be made to lie • Need techniques that extract meaning and provide insight without losing the information the data represents May 18, 2010 9
  • 10. Statistics • I can take all this … • … And give you one derived number (average) − 107941.931 May 18, 2010 10
  • 11. Statistic • 4,000 numbers reduced to 1 • Reduced the amount of data by 99.975% (another “statistic”) • But I have lost information • Average value of 107941.931 is at best a simplistic view of the data and at worst a distortion that misrepresents the source data • If I use the average without looking to understand the raw data in more detail I am potentially creating a distortion May 18, 2010 11
  • 12. More Statistics Average Sum of all the values divided by the number of values 107941.93 Standard A measure of how widely values are dispersed from the average value 59904.19 Deviation Kurtosis Value that describes the relative peakedness or flatness of a distribution 0.112 where a positive value indicates a relatively peaked distribution and a negative value indicates a relatively flat distribution Skewness A measure of the asymmetry of a distribution around the average where a 0.731 positive value indicates a distribution with an asymmetric tail extending toward more positive values and a negative value indicates a distribution with an asymmetric tail extending toward more negative values Mode The most frequently occurring value 23958 Median This the number in the middle where, half the numbers have values that are 97909.5 greater than the median and half have values that are less – also called the 50th percentile • Be careful what statistics are used • Do not generate statistics just because you can • The use of statistics can give a false impression of certainty or meaning where there is none May 18, 2010 12
  • 13. Interpreting the Statistics Statistic Value Interpretation Average 107941.93 The average is higher than the median indicating that the data is dispersed unequally towards higher values Standard Deviation 59904.19 The high standard deviation indicates the underlying data is spread across a wide range of values Kurtosis 0.112 The positive value indicates that there is a peak in the data Skewness 0.731 The positive values indicates a distribution with an unequal and heavy tail extending toward more higher values Mode 23958 In a large set of data where only a small number of data values are the same, this is meaningless Median 97909.5 When the median is less than the average, it means the data is unequally distributed with a heavy tail extending toward more higher values • I now know that the data is skewed towards lower values and has a heavy tail indicating a small number of people earning large salaries May 18, 2010 13
  • 14. Number of People 0 10 20 30 40 50 0 60 May 18, 2010 20 00 0 40 00 0 60 00 0 80 00 0 10 00 00 12 00 00 14 00 00 16 00 00 Let’s Take a Look at the Data 18 00 Annual Salary 00 20 00 00 22 00 00 24 00 00 26 00 00 28 00 00 30 00 00 14
  • 15. Let’s Take a Look at the Data Clustered Increases around quickly Gradual drop lower values from peak • Characteristics from zero 60 − Increases quickly from zero 50 − Distribution skewed to the left 40 Number of People − Clustered around lower Heavy tail values 30 − Gradual drop from 20 peak − Heavy tail 10 • This type of data 0 distribution is very 0 0 0 0 0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 common 20 40 60 80 10 12 14 16 18 20 22 24 26 28 30 Annual Salary Distribution skewed to the left May 18, 2010 15
  • 16. Statistics 0.4 • The usefulness of a statistic 0.35 depends on the underlying data 0.3 • Average really only makes sense when the data is 0.25 symmetrically/equally 0.2 distributed 0.15 − Otherwise, the average is distorted because of unequal distribution of 0.1 data 0.05 • Deviation also really only makes 0 sense when the data is -5 -4.5 -4.1 -3.6 -3.2 -2.7 -2.2 -1.8 -1.3 -0.9 -0.4 0.06 0.52 0.98 1.44 1.9 2.36 2.82 3.28 3.74 4.2 4.66 symmetrically distributed May 18, 2010 16
  • 17. Statistics • Be careful of obscure statistics such as Kurtosis and Skewness • They have a use but the meaning is quite specific and may not be appropriate May 18, 2010 17
  • 18. Descriptive Statistics • Look for statistics that contain − Measures of data location and clustering − Measures of dispersion and variability − Measures of association • Look at the underlying data, how it was collected, what it measures − If the data is of poor quality or measures the wrong values, any derived information will have very limited worth • There are lots of statistics that can be produced from the raw data − Produce only meaningful statistics − Do not throw statistics at the data May 18, 2010 18
  • 19. Some Common Descriptive and Summarising Statistics Statistic Type Statistic Description Data location and Clustering Average Simple average Weighted Average Average of values weighted according to a value such as their importance Truncated/Interpercentile Average Average of centralised subset of data Median The 50th percentile Mode The most commonly occurring value Dispersion, Variability and Shape Variance Measure of the amount of variation within the data Standard Deviation Square root of the Variance Range The spread of the data values Skewness Measure of the asymmetry of the distribution of the data Kurtosis Measure of the "peakedness” and the length of the tail of the distribution of the data Percentiles Value below which a certain percent of the data fall Association Correlation Correlation has a specific meaning that may not be relevant to the data May 18, 2010 19
  • 20. Another Look at the Sample Data 320000 300000 280000 260000 240000 Annual Salary 220000 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 0% 5% % % % % % % % % % % % % % % % % % % 0% 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 10 Percentage Earning Up to Salary Amount • This shows the salaries of cumulative percentages of the people surveyed May 18, 2010 20
  • 21. Another Look at the Sample Data 290000 - 300000 15 290000 - 300000 0.4% 280000 - 290000 17 280000 - 290000 0.4% 270000 - 280000 20 270000 - 280000 0.5% 260000 - 270000 22 260000 - 270000 0.6% 250000 - 260000 27 250000 - 260000 0.7% 240000 - 250000 32 240000 - 250000 0.8% 230000 - 240000 38 230000 - 240000 1.0% 220000 - 230000 47 220000 - 230000 1.2% 210000 - 220000 55 210000 - 220000 1.4% 200000 - 210000 67 200000 - 210000 1.7% 190000 - 200000 84 190000 - 200000 2.1% 180000 - 190000 96 180000 - 190000 2.4% 170000 - 180000 112 170000 - 180000 2.8% Salary Range Salary Range 160000 - 170000 128 160000 - 170000 3.2% 150000 - 160000 146 150000 - 160000 3.7% 140000 - 150000 166 140000 - 150000 4.2% 130000 - 140000 187 130000 - 140000 4.7% 120000 - 130000 209 120000 - 130000 5.2% 110000 - 120000 230 110000 - 120000 5.8% 100000 - 110000 249 100000 - 110000 6.2% 90000 - 100000 267 90000 - 100000 6.7% 80000 - 90000 280 80000 - 90000 7.0% 70000 - 80000 285 70000 - 80000 7.1% 60000 - 70000 283 60000 - 70000 7.1% 50000 - 60000 268 50000 - 60000 6.7% 40000 - 50000 237 40000 - 50000 5.9% 30000 - 40000 193 30000 - 40000 4.8% 20000 - 30000 133 20000 - 30000 3.3% 10000 - 20000 83 10000 - 20000 2.1% 0 - 10000 24 0 - 10000 0.6% 0 50 100 150 200 250 300 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% Number of People Percentage of People May 18, 2010 21
  • 22. Percentiles • Percentile of a set of data is the number or value below which that percent of data lies • Median = 50th percentile − Value below which 50% of data lies • Quartiles are percentiles for 25%, 50% and 75% • Percentiles are useful in summarising data May 18, 2010 22
  • 23. Percentiles for Sample Data • This … • … becomes this … • 4,000 numbers reduced to 10 numbers − 10% of people earn 38,332 or less − 20% of people earn 54,834 or less − 10% of people earn between 192,871 and 299,433 • Successfully reduced the volume of data while preserving more information May 18, 2010 23
  • 24. Anti-Statistics • Unfortunately everywhere • Take a number of general forms or types such as − Statement based on measurement of incorrect value − Statement without scale or reference − Statement based on grouping of categories (with possible distortion of categories) − Statements based on inaccurate on unspecified association or correlation May 18, 2010 24
  • 25. Sample Type 1 Anti-Statistic • Chimpanzee DNA is 99.7% the same as Human DNA • What does this statement mean? − Do chimpanzees make cars/houses/PCs/etc. that are 99.7% as good as those made by humans? • If the statement is true then what is being measured may be invalid, such as • 000000000000000000000000 and 000000000000000000000001 • These numbers are 99% the same based on the length of the lines in their characters − Or • A lot of DNA is not involved in the development process and this is being included in measurements − Or • A small change in DNA has a substantial impact on what is produced May 18, 2010 25
  • 26. Sample Type 2 Anti-Statistic • Statements of the form − X is the greatest cause of Y, such as • Car crashes are the greatest cause of deaths among males in their 20s and 30s • Meaningless because there is no scale or reference point • Statement creates an impression of scale and severity that is at best not justified or at worst incorrect • Take a look at the underlying life expectancy data May 18, 2010 26
  • 27. Type 2 Anti-Statistic • Probability of a person dying • Probability of a person dying within a year at each year of life within a year for first 35 years 0.6 0.0045 0.004 Probability of Dying Within One Year Probability of Dying Within One Year 0.5 0.0035 0.4 0.003 0.0025 0.3 0.002 0.2 0.0015 0.001 0.1 0.0005 0 0 20 Yea s 25 ea s 30 Yea s 35 Yea s Y rs 45 Yea s rs 55 Yea s 60 Yea s rs 70 Yea s rs 80 Yea s 85 Yea s rs 95 Yea s 10 Ye rs 10 Ye rs 5 ars s rs 5 0 0 5 10 15 20 25 30 35 r Y r r r r r r r r r r ar 15 Yea 40 ea 50 Yea 65 Yea 75 Yea 90 Yea 0 a 10 Yea Ye Years Years Years Years Years Years Years May 18, 2010 27
  • 28. Type 2 Anti-Statistic • The underlying life expectancy data shows that young people have very little chance of dying • Death rates are uniformly very low after the first year of life until about age 50 • So a statement such as − Car crashes are the greatest cause of deaths among males in their 20s and 30s • Will inevitably be true because nothing else really kills young males − Death due to illness is uncommon among this group so any other cause will dominate May 18, 2010 28
  • 29. Sample Type 3 Anti-Statistic • Statements of the form − N% of people do/have done X at least N times/with defined frequency − Typically arise as the results of tendentious surveys designed to create a false impression of severity • Such as − 75% of people admit to X up to N times a year • No indication of how the 75% is spread across the range of 1 to N times − 65% of people admit to having a negative experience up to N times due to X • No indication of the spread of negative experiences across the range of 1 to N • Generally a result of combining the responses to two or more questions or categories − Have often have you done/experienced X? • Once • Twice • Three times • … May 18, 2010 29
  • 30. Type 3 Anti-Statistic • Have often have you • Have often have you done/experienced X? done/experienced X? − Once − 45% − Twice − 10% − Three times − 8% − 4-8 times − 5% − 8-12 times − 2% • Total of these is 75% • Statement that 75% of people have done/experienced X up to 12 times a year distorts the distribution of the underlying data that is skewed towards lower rates of occurrence May 18, 2010 30
  • 31. Sample Type 4 Anti-Statistic • Statements of the form − Taking /doing A makes you N% more likely to be/experience B • Two issues − Causation – is there a real causal relationship − Degree of causation – how strong is the causal relationship • An association does not imply a causation − A might cause B − B might cause A − A might cause B and B might cause A − A might cause C that might cause B − A might cause C that might cause D … that might cause B − A might cause C that might cause B and A might cause D that might not cause B but A-C- D causation is greater than A-D-B negative causation − Measuring error − Random data that was skewed − Deliberate or malicious misrepresentation • Cause might be partial or contributory • Be careful of any statement of a relationship that does not demonstrate how causation happens May 18, 2010 31
  • 32. Association and Causation Scenarios Causes or Influences A B A B Causes or Influences Causes or Influences A B C D A Causes or Influences B D Negatively Causes or A B A Influences B Causes or Influences Causes or Influences C C May 18, 2010 32
  • 33. Association and Causation • Very common scenario where an association or causation is asserted Takes or Taking or Doing Does D D Affects or Causes B A B May 18, 2010 33
  • 34. Association and Causation • The real association or causation is actually along the lines of: Takes or Taking or Doing D Has Does D Little or No Effect or Influence on B or Even Members of Negatively Impacts B Group C Have a Greater Tendency to A Take or do D B Members of Group C Also Take or Do E Taking or Doing E Is a Affects or Causes Member of B a Group C E May 18, 2010 34
  • 35. Type 4 Anti-Statistic • Occurs very frequently • A percentage association can give a false sense of certainty − Just measures the looseness of association • Often misrepresents the degree of causation • Unless the precise nature of the causative relationship can be defined, take with a large dose of salt May 18, 2010 35
  • 36. Summary • Statistics are designed to provide insight without distorting the meaning of the underlying data or losing information • Anti-statistics are used to distort the underlying data to create false impressions • So there are Lies, Damn Lies and Anti-Statistics May 18, 2010 36
  • 37. More Information Alan McSweeney alan@alanmcsweeney.com May 18, 2010 37