Here are some potential arguments for different bin sizes in the revenue histogram example:
- One day bin size: Could show day-to-day fluctuations and identify specific high/low revenue days. However, 365 bins may be too granular and show random noise.
- One month bin size: Could identify seasonal trends across months while still showing monthly variations. 12 bins balances granularity and visibility.
- One quarter bin size: Would simplify the histogram but may hide important variations across months within each quarter. Only 4 bins may obscure seasonal patterns.
The appropriate bin size depends on whether the goal is to identify daily, monthly, or quarterly trends. Monthly binning (12 bins) seems best for capturing seasonal patterns in revenue
2. We’ve described histograms as being extremely flexible and
having the ability to condense large data sets into usable
terms. The key to the flexibility and versatility of histograms
is bin size.
“Bin” is another word for the set of intervals that define the
x-axis of a histogram. Bins must be equal in size. Constructing
a histogram with varied bin sizes can lead to a lot of
confusion.
Salaries Ranges (Sears, LLC)
200
150
100
50
0
3. One challenge when creating a histogram is selecting the
number of bins. Determining the interval can be arbitrary, but
there are a few methods to selecting the number of bins:
1. Count the number “n” of total data points
2. Take the square root of n, round up
Let’s try one. You have a data set where n = 55. To determine
the number of bins, you would take the square root of 55 =
7.416. Rounded up = 8. So with 8 being the optimum number
of bins, you can then look at the type of data and determine
what eight equal intervals you would like to display.
Source: http://www.qimacros.com/quality-tools/how-to-determine-histogram-bin-interval/
4. Here’s another way to determine the number of bins to use:
1. Determine the bin range (max p – min p)
2. Determine the width “h” you want for the bins
3. Divide the bin range by the desired width:
b = (max p – min p)
h
Let’s try one of these. You have a data set where the largest
number is 100 and the smallest is 5. So, the bin range would
be 100-5 = 95. You decide you want the bin interval to be 10.
Now, to calculate the number of bins, you take 95 / 10, you
get 9.5. Round up = 10! Easy right?!
Source: http://www.qimacros.com/quality-tools/how-to-determine-histogram-bin-interval/
5. Remember, histograms are flexible because of their bins. You
don’t have to do fancy calculations, you can just arbitrarily
adopt a number of bins so long as they have equal intervals.
But another thing to remember is that after a certain
point, usually beyond 20 bins, even a histogram can get
difficult to follow. Here is a quick reference chart to help you
choose the right number of bins.
Number of Data Points Number of Bars
20 - 50 6
51 - 100 7
101 - 200 8
201 - 500 9
501 – 1000 10
1000+ 11-20
6. In Slide 4, we just picked a number out of the blue for bin
width h, but we have some cautions about this one too. The
shape of a histogram is susceptible to the width of bins.
If the bins are too wide, important information might be
hidden. If the bins are too narrow, what appears now to be
meaningful inconstancy might just be a random variation.
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 More
Source: http://www.netmba.com/statistics/histogram/
7. Depending on the actual data distribution and the goals of
your analysis, you may choose different widths. In
fact, depending on the situation, you might create two
histograms from the same data set with different bin sizes.
How do you know which h to use? Sometimes it just comes
down to experimenting. Try different widths until the data
depicts an honest story of what you’re trying to analyze. Now
you see why we caution against bin size tampering!
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 More
Source: http://www.netmba.com/statistics/histogram/
8. EXAMPLE SECENARIO: You are the Sales Manager for an
online retail company. You want to track revenue numbers for
the year so you decide to generate a histogram.
Theoretically, you could create a histogram with a bin size of 1
day, and there would be 365 bins. Or you could create a bin
size of one full quarter, and there would be only four. Would
these two histograms tell a different story? Absolutely!
25 <cont.>
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 More
9. Showing 365 bins for revenue would make for a cumbersome
histogram that might show random events. Showing 4 bins
for revenue is much simpler, but might hide important
variations. A more accurate story might be told for a
histogram with 12 bins for revenue by month. Same
data, different story—but 12 bins might do a better job of
condensing a large amount of data while still capturing
important variations—like seasonal fluxuations.
Experimentation is the key to bin width h.
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 More
10. LETS RECAP!
1. Find the number of bins by taking the square root of n
and rounding up.
2. Experimentation works for both number of bins and bin
width.
3. Too many or too few bins can show random events or hide
important information.
4. The appropriate number of bins combined with the
proper bin width can tell a powerful story of data!
25
20
15
10
5
0
11. CRITICAL THINKING: In the pervious
example, what would be arguments for using bin
sizes of one day, one month, and one quarter?
Think about what you’re trying to analyze.
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
More