SlideShare ist ein Scribd-Unternehmen logo
1 von 11
We’ve described histograms as being extremely flexible and
having the ability to condense large data sets into usable
terms. The key to the flexibility and versatility of histograms
is bin size.

“Bin” is another word for the set of intervals that define the
x-axis of a histogram. Bins must be equal in size. Constructing
a histogram with varied bin sizes can lead to a lot of
confusion.
                      Salaries Ranges (Sears, LLC)

           200
           150
           100
            50
             0
One challenge when creating a histogram is selecting the
number of bins. Determining the interval can be arbitrary, but
there are a few methods to selecting the number of bins:

1. Count the number “n” of total data points
2. Take the square root of n, round up

Let’s try one. You have a data set where n = 55. To determine
the number of bins, you would take the square root of 55 =
7.416. Rounded up = 8. So with 8 being the optimum number
of bins, you can then look at the type of data and determine
what eight equal intervals you would like to display.

Source: http://www.qimacros.com/quality-tools/how-to-determine-histogram-bin-interval/
Here’s another way to determine the number of bins to use:

1. Determine the bin range (max p – min p)
2. Determine the width “h” you want for the bins
3. Divide the bin range by the desired width:

     b = (max p – min p)

               h
Let’s try one of these. You have a data set where the largest
number is 100 and the smallest is 5. So, the bin range would
be 100-5 = 95. You decide you want the bin interval to be 10.
Now, to calculate the number of bins, you take 95 / 10, you
get 9.5. Round up = 10! Easy right?!

Source: http://www.qimacros.com/quality-tools/how-to-determine-histogram-bin-interval/
Remember, histograms are flexible because of their bins. You
don’t have to do fancy calculations, you can just arbitrarily
adopt a number of bins so long as they have equal intervals.
But another thing to remember is that after a certain
point, usually beyond 20 bins, even a histogram can get
difficult to follow. Here is a quick reference chart to help you
choose the right number of bins.

   Number of Data Points        Number of Bars

   20 - 50                      6

   51 - 100                     7

   101 - 200                    8

   201 - 500                    9

   501 – 1000                   10

   1000+                        11-20
In Slide 4, we just picked a number out of the blue for bin
width h, but we have some cautions about this one too. The
shape of a histogram is susceptible to the width of bins.

If the bins are too wide, important information might be
hidden. If the bins are too narrow, what appears now to be
meaningful inconstancy might just be a random variation.
                  25

                  20

                  15

                  10

                       5

                       0
                           1   2   3   4   5   6   7   8   9   10   More


Source: http://www.netmba.com/statistics/histogram/
Depending on the actual data distribution and the goals of
your analysis, you may choose different widths. In
fact, depending on the situation, you might create two
histograms from the same data set with different bin sizes.

How do you know which h to use? Sometimes it just comes
down to experimenting. Try different widths until the data
depicts an honest story of what you’re trying to analyze. Now
you see why we caution against bin size tampering!
                       25

                       20

                       15

                       10
                            5
                            0
                                1   2   3   4   5   6   7   8   9   10   More
Source: http://www.netmba.com/statistics/histogram/
EXAMPLE SECENARIO: You are the Sales Manager for an
online retail company. You want to track revenue numbers for
the year so you decide to generate a histogram.

Theoretically, you could create a histogram with a bin size of 1
day, and there would be 365 bins. Or you could create a bin
size of one full quarter, and there would be only four. Would
these two histograms tell a different story? Absolutely!

                25                                                        <cont.>
                 20

                 15

                 10
                     5
                      0
                          1   2   3   4   5   6   7   8   9   10   More
Showing 365 bins for revenue would make for a cumbersome
histogram that might show random events. Showing 4 bins
for revenue is much simpler, but might hide important
variations. A more accurate story might be told for a
histogram with 12 bins for revenue by month. Same
data, different story—but 12 bins might do a better job of
condensing a large amount of data while still capturing
important variations—like seasonal fluxuations.

Experimentation is the key to bin width h.
                25

                20

                15

                10
                     5
                     0
                         1   2   3   4   5   6   7   8   9   10   More
LETS RECAP!

1. Find the number of bins by taking the square root of n
   and rounding up.
2. Experimentation works for both number of bins and bin
   width.
3. Too many or too few bins can show random events or hide
   important information.
4. The appropriate number of bins combined with the
   proper bin width can tell a powerful story of data!
               25
               20
               15
               10
                    5
                    0
CRITICAL THINKING: In the pervious
example, what would be arguments for using bin
sizes of one day, one month, and one quarter?
Think about what you’re trying to analyze.

      25

      20

      15

       10

           5

           0
               1   2   3   4   5   6   7   8   9   10
                                                        More

Weitere ähnliche Inhalte

Ähnlich wie Module 3.1

Analytics-Bucketing
Analytics-BucketingAnalytics-Bucketing
Analytics-Bucketing
andrew chow
 
Data visualization
Data visualizationData visualization
Data visualization
Tony Nguyen
 
Data visualization
Data visualizationData visualization
Data visualization
James Wong
 
Data visualization
Data visualizationData visualization
Data visualization
Fraboni Ec
 
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptDECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
glorypreciousj
 

Ähnlich wie Module 3.1 (20)

2 olap operaciones
2 olap operaciones2 olap operaciones
2 olap operaciones
 
Module 3.2
Module 3.2Module 3.2
Module 3.2
 
WEBINAR: 5 Ways to Create Charts & Graphs to Highlight Your Work (Intermediate)
WEBINAR: 5 Ways to Create Charts & Graphs to Highlight Your Work (Intermediate)WEBINAR: 5 Ways to Create Charts & Graphs to Highlight Your Work (Intermediate)
WEBINAR: 5 Ways to Create Charts & Graphs to Highlight Your Work (Intermediate)
 
Analytics-Bucketing
Analytics-BucketingAnalytics-Bucketing
Analytics-Bucketing
 
Multi dimensional model vs (1)
Multi dimensional model vs (1)Multi dimensional model vs (1)
Multi dimensional model vs (1)
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.pptDECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
DECISION TREEScbhwbfhebfyuefyueye7yrue93e939euidhcn xcnxj.ppt
 
BasicTools-Histogram.ppt
BasicTools-Histogram.pptBasicTools-Histogram.ppt
BasicTools-Histogram.ppt
 
L8 scientific visualization of data
L8 scientific visualization of dataL8 scientific visualization of data
L8 scientific visualization of data
 
1.1 data analytics case studies and examples
1.1 data analytics case studies and examples1.1 data analytics case studies and examples
1.1 data analytics case studies and examples
 
What Is Good DataViz Design?
What Is Good DataViz Design?What Is Good DataViz Design?
What Is Good DataViz Design?
 
Data analysis01 singlevariable
Data analysis01 singlevariableData analysis01 singlevariable
Data analysis01 singlevariable
 
7 Principles for Engaging Users with Visualization
7 Principles for Engaging Users with Visualization7 Principles for Engaging Users with Visualization
7 Principles for Engaging Users with Visualization
 
Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs Data Compression in Data mining and Business Intelligencs
Data Compression in Data mining and Business Intelligencs
 

Mehr von druhbrown

Mehr von druhbrown (9)

Module 4.3
Module 4.3Module 4.3
Module 4.3
 
Module 4.2
Module 4.2Module 4.2
Module 4.2
 
Module 4.1
Module 4.1Module 4.1
Module 4.1
 
Module 2.3
Module 2.3Module 2.3
Module 2.3
 
Module 2.2
Module 2.2Module 2.2
Module 2.2
 
Module 2.1
Module 2.1Module 2.1
Module 2.1
 
Module 1.3
Module 1.3Module 1.3
Module 1.3
 
Module 1.2
Module 1.2Module 1.2
Module 1.2
 
Module 1.1
Module 1.1Module 1.1
Module 1.1
 

Module 3.1

  • 1.
  • 2. We’ve described histograms as being extremely flexible and having the ability to condense large data sets into usable terms. The key to the flexibility and versatility of histograms is bin size. “Bin” is another word for the set of intervals that define the x-axis of a histogram. Bins must be equal in size. Constructing a histogram with varied bin sizes can lead to a lot of confusion. Salaries Ranges (Sears, LLC) 200 150 100 50 0
  • 3. One challenge when creating a histogram is selecting the number of bins. Determining the interval can be arbitrary, but there are a few methods to selecting the number of bins: 1. Count the number “n” of total data points 2. Take the square root of n, round up Let’s try one. You have a data set where n = 55. To determine the number of bins, you would take the square root of 55 = 7.416. Rounded up = 8. So with 8 being the optimum number of bins, you can then look at the type of data and determine what eight equal intervals you would like to display. Source: http://www.qimacros.com/quality-tools/how-to-determine-histogram-bin-interval/
  • 4. Here’s another way to determine the number of bins to use: 1. Determine the bin range (max p – min p) 2. Determine the width “h” you want for the bins 3. Divide the bin range by the desired width: b = (max p – min p) h Let’s try one of these. You have a data set where the largest number is 100 and the smallest is 5. So, the bin range would be 100-5 = 95. You decide you want the bin interval to be 10. Now, to calculate the number of bins, you take 95 / 10, you get 9.5. Round up = 10! Easy right?! Source: http://www.qimacros.com/quality-tools/how-to-determine-histogram-bin-interval/
  • 5. Remember, histograms are flexible because of their bins. You don’t have to do fancy calculations, you can just arbitrarily adopt a number of bins so long as they have equal intervals. But another thing to remember is that after a certain point, usually beyond 20 bins, even a histogram can get difficult to follow. Here is a quick reference chart to help you choose the right number of bins. Number of Data Points Number of Bars 20 - 50 6 51 - 100 7 101 - 200 8 201 - 500 9 501 – 1000 10 1000+ 11-20
  • 6. In Slide 4, we just picked a number out of the blue for bin width h, but we have some cautions about this one too. The shape of a histogram is susceptible to the width of bins. If the bins are too wide, important information might be hidden. If the bins are too narrow, what appears now to be meaningful inconstancy might just be a random variation. 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More Source: http://www.netmba.com/statistics/histogram/
  • 7. Depending on the actual data distribution and the goals of your analysis, you may choose different widths. In fact, depending on the situation, you might create two histograms from the same data set with different bin sizes. How do you know which h to use? Sometimes it just comes down to experimenting. Try different widths until the data depicts an honest story of what you’re trying to analyze. Now you see why we caution against bin size tampering! 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More Source: http://www.netmba.com/statistics/histogram/
  • 8. EXAMPLE SECENARIO: You are the Sales Manager for an online retail company. You want to track revenue numbers for the year so you decide to generate a histogram. Theoretically, you could create a histogram with a bin size of 1 day, and there would be 365 bins. Or you could create a bin size of one full quarter, and there would be only four. Would these two histograms tell a different story? Absolutely! 25 <cont.> 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More
  • 9. Showing 365 bins for revenue would make for a cumbersome histogram that might show random events. Showing 4 bins for revenue is much simpler, but might hide important variations. A more accurate story might be told for a histogram with 12 bins for revenue by month. Same data, different story—but 12 bins might do a better job of condensing a large amount of data while still capturing important variations—like seasonal fluxuations. Experimentation is the key to bin width h. 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More
  • 10. LETS RECAP! 1. Find the number of bins by taking the square root of n and rounding up. 2. Experimentation works for both number of bins and bin width. 3. Too many or too few bins can show random events or hide important information. 4. The appropriate number of bins combined with the proper bin width can tell a powerful story of data! 25 20 15 10 5 0
  • 11. CRITICAL THINKING: In the pervious example, what would be arguments for using bin sizes of one day, one month, and one quarter? Think about what you’re trying to analyze. 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 More