Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Effective management of high volume numeric data with histograms
1. Effective management of high volume
numeric data with histograms
Fred Moyer @Circonus
DataEngConf SF ‘18
2. @phredmoyer
● Engineer to Engineer @circonus
● Recovering C and Perl programmer
● Geeking out on histograms since 2015
3. Pain driven development
● Observability tools caused a telemetry firehose
● Existing monitoring systems got washed away
● Average based metrics gave limited insight
4. “Effective Management”
● Performance AND scalability
● Avoid memory allocations, copies, locks, waits
● Persist data in size efficient structures
19. Storage efficiency - 1 month
30 days of one minute histograms
30 days * 24 hours/day * 60 bins/hour * 300 bin
span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB
= 123.6 MB
20. Storage efficiency - 1 year
365 days of five minute histograms
365 days * 24 hours/day * 12 bins/hour * 300 bin
span * 10 bytes/bin * 1kB/1,024bytes * 1MB/1024kB
= 300.9 MB
21. Quantile calculation
1. Given a quantile q(X) where 0 < X < 1
2. Sum up the counts of all the bins, C
3. Multiply X * C to get count Q
4. Walk bins, sum bin boundary counts until > Q
5. Interpolate quantile value q(X) from bin
23. Recap
● Several different types of histograms
● Highly space efficient
● O(1) and O(n) complexity calculating quantiles
● What other fun things can we do?
24. Inverse Quantiles
● What’s the 95th percentile latency?
○ q(0.95) = 10ms
● What percent of requests exceeded 10ms?
○ 5% for this data set; what about others?
25. Inverse Quantile calculation
1. Given a sample value X, locate its bin
2. Using the previous linear interpolation
equation, solve for Q given X
28. Inverse Quantile calculation
1. Given a sample value X, locate its bin
2. Using the previous linear interpolation
equation, solve for Q given X
3. Sum the bin counts up to Q as Qleft
4. Inverse quantile qinv(X) = (Qtotal-Qleft)/Qtotal
5. For Qleft=700, Qtotal = 1,000, qinv(X) = 0.3
6. 30% of sample values exceeded X
Hello! I’m Fred Moyer from Circonus, and today I’ll be talking about effective management of high volume numeric data with histograms.
There’s my twitter handle, it uses a ph instead of an f. I’m an Engineer at Circonus who does a lot of talking to other engineers. I’m a recovering C and Perl programmer (though I’ve been writing more C lately). I’ve been using histograms to monitor operational systems for about seven years, but I’ve been really diving into them over the past few years.
I want to start off by saying that everything I’m about to show you was motivated by pain. Circonus evolved from people working with large distributed systems who needed to diagnose why those systems weren’t behaving as they should. The observability tools that we employed to peer into those systems yielded a massive amount of operational telemetry.
The existing monitoring tools at hand couldn’t handle the scale of that telemetry. Also, nearly all of those tools only worked with average values. As some of you probably know, using average based metrics doesn’t get you very far.
Just what does it mean to effectively manage billions of data points? Your system has to be both performant and scalable. How do you accomplish that?
Not only do your algorithms have to be on point, but your implementation of them has to be efficient. You want to avoid allocating memory where possibly, avoid copying data (pass pointers around instead), avoid locks, and avoid waits. Lots of little optimizations that add up to being able to run your code as close to the metal as possible.
You also need to be able to scale your data structures. They need to be as size efficient as possible, which means using strongly typed languages with the optimum choice of data types. We’ll look at this in a bit, but first let’s dive into histograms.
How many people here know what a histogram is?
This histogram diagram shows a skewed histogram where the mode is near the minimum value q(0). The Y axis is the number of samples (or sample density), and the X axis shows the sample value.
How many have used a histogram in practice?
How many people know what a percentile or a quantile is?
On this histogram we can see that the median is slightly left of the midpoint between the highest value and the lowest value. The mode is at a low sample value, so the median is below the mean, or average value. The 90th percentile is also called q(0.9), and is where 90 percent of the sample values are below it.
How many people here know what a heatmap is?
This might look like the display in KITT from Knight Rider (2nd generation display), but this is a heatmap. A heatmap is essentially a series of histograms over time.
How many have used a heatmap in practice?
This heatmap represents web service request latency. The parts that are red are where the sample density is the highest. So we can that most of the latencies tend to concentrate around 500 nanoseconds. We can overlay quantiles onto this visualization, we’ll cover that in a bit.
Fixed bucket histograms require the user to specify the bucket or bin boundaries. The terms bucket and bin are generally used interchangeably to refer to the boundaries between histogram data elements. Wikipedia uses bin as the designation.
Traits:
Histogram bin sizes can be fine tuned for known data sets to achieve increased precision
Cannot be merged with other types of histograms because the bin boundaries are likely uncommon
Lesser experienced users will likely pick suboptimal bin sizes
If you change your bin sizes, you can’t do calculations across older configurations
Approximate histograms such as the t-digest histogram (created by Ted Dunning) use approximations of values, such as this example shown which displays centroids. The number of samples on each side of the centroid is the same. The centroids indicate
Traits:
Space efficient
High accuracy at extreme percentiles (95%, 99%+)
Worst case errors ~10% at the median with small sample sizes
Can be merged with other t-digest histograms
Linear histograms have one bin at even intervals, such as one bin per integer. Because the bins are all evenly sized, this type of histogram uses a large number of bins.
Traits:
Accuracy dependent on data distribution and bin size
Low accuracy at fractional sample values (though this indicates improperly sized bins)
Inefficient bin footprint at higher sample values
Log Linear histograms have bins at logarithmically increasing intervals
Traits:
High accuracy at all sample values
Fits all ranges of data well
Worst case bin error ~5%, but only with absurdly low sample density
Bins are often subdivided into even slices for increased accuracy
HDR histograms (high dynamic range) are a type of log linear histograms
Cumulative histograms are different from other types of histograms in that each successive bin contains the sum of the counts of previous bins.
Traits:
Final bin contains the total sample count, and as such is q(1)
Used at Google for their Monarch monitoring system
Easy to calculate bin quantile - just divide the count by the maximum
There’s nothing stopping you from putting together your own histogram implementation based on a combination of the ones I’ve just shown you. When preparing this presentation I researched cumulative histograms, and mistakenly interpreted the cumulative aspect to be cumulative over time, not over bin counts. So in effect I had created a custom histogram specification, one that summed bin counts over time.
Traits:
O(1) calculations for time based ranges - you don’t need to sum up all the intermediate histograms, just diff the time slices.
For displaying N slices in a heatmap, you have 2N operations (diff each time interval) instead of N operations as you would with these other implementations.
Copes better with data loss in the middle of time ranges
We’re going to take a look at what a programmatic implementation of a log linear histogram looks like. Circonus has released our implementation of log linear histograms, in both C and Golang.
We’ll look at a few things:
Why it scales
Why it is fast
How to calculate quantiles
First let’s examine what the visual representation of this histogram is, to get an idea of the structure. At first glance this looks like a linear histogram, but take a look at the 1 million point on the X axis.
You’ll notice a change in bin size by a factor of 10. Where there was a bin from 990k to 1M, the next bin spans 1M to 1.1M. Each power of 10 contains 90 bins evenly spaced.
Why not 100 bins you might think? Because the lower bound isn’t zero. In the case of a lower bound of 1 and an upper bound of 10, 10-1 = 9, and spacing those in 0.1 increments yields 90 bins.
Here’s an alternate look at the bin size transition. You can see the transition to larger bins at the 1,000 sample value boundary.
The C implementation of each bin is a struct containing a value and exponent struct paired with the count of samples in that bin. This diagram shows the overall memory footprint of each bin. The value is one byte representing the value of the data sample times 10. The Exponent is the power of 10 of the bin, and ranges from -128 to +127.
The sample count is an unsigned 64 bit integer. This field is variable bit encoded, and as such occupies a maximum of 8 bytes.
Many of the existing time series data stores out there store a single data point as an average using the value as a uint64. That’s one value for every eight bytes - this structure can store virtually an infinite count of samples in this bin range with the same data storage requirements.
Let’s take a look at what the storage footprint is in practice for this type of bin data structure. In practice, we have not seen more than a 300 bin span for operational sample sets. Bins are not pre-allocated linearly, they are allocated as samples are recorded for that span, so the histogram data structure can have gaps between bins containing data.
These calculations show that we can store 30 days of one minute distributions in a maximum space of 123.6 megabytes. Less than a second of disk read operations if that.
Now, 30 days of one minute averages only takes about a third of a megabyte - but that data is essentially useless for any sort of analysis.
Let’s examine what a year’s worth of data looks like in five minute windows.
The same calculation with different values yield a maximum of about 300 megabytes to represent a year’s worth of data in five minute windows.
Note that this is invariant to the total number of samples; the biggest factors in the actual size of the data are the span of bins covered, and the compression factor per bin.
Let’s talk about performing quantile calculations. The quantile notation q of X is just a fancy way of specifying a percentile. The 90th percentile would be q of 0.9, the 95th would be q of 0.95, and the maximum would be q of 1.
So say we wanted to calculate the median, q of 0.5. First we iterate over the histogram and sum up the counts in all of the bins. Remember that cumulative histogram we talked about earlier? Guess what - the far right bin already contains the total count, so if you are using a cumulative histogram variation, that part is already done for you and can be a small optimization for quantile calculation since you have an O(1) operation instead of O(n)
Now we multiple that count by the quantile value. If we say our count is 1,000 and our median is 0.5, we get a Q of 500.
Next iterate over the bins, summing the left and right bin boundary counts, until Q is between those counts. If the count Q matches the left bin boundary count, our quantile value is that bin boundary value. If Q falls in between the left and right boundary counts, we use linear interpolation to calculate the sample value.
Let’s go over the linear interpolation part of quantile calculation. Once we have walked the bins to where Q is located in a certain bin, we use the formula shown here.
Little q of X is the left bin value plus big Q minus the left bin boundary, divided by the count differences of the left and ride sides of the bin, times the bin width.
Using this approach we can determine the quantile for a log linear histogram to a high degree of accuracy.
In terms of what error levels we can experience in terms of quantile calculation, with one sample in a bin it is possible to see a worst case 5% error if the value is 109.9 in a bin bounding 100-110; the reported value would be 105. The best case error for one sample is 0.5% for a value of 955 in a bin spanning 950-960.
However, our use of histograms is geared towards very large sets of data. With bins that contain dozens, hundreds, or more samples, accuracy should be expected to 3 or 4 nines.
So let’s review what we’ve just gone over. We looked at a few different implementations of histograms. We got an overview of Circonus’ open source implementation of a log linear histogram. What data structures it uses to codify bin data. The algorithm used to calculate quantiles (minus some implementation specific optimizations).
I highly recommend reading the code. The points I mentioned previously about avoiding locks, memory allocations, waits, and copies are a big part of making the calculations highly optimized. Good algorithms are still limited by how they are implemented on the metal itself.
From what we’ve looked at, most folks should be able to implement their own time series data store based on the open source library here, if you are into that sort of thing.
One problem to solve might be to collect all of the syscall latency data on your systems via eBPF, and compare the 99th percentile syscall latencies one you applied the Spectre patch which disabled CPU branch prediction. It could be you find that your 99th percentile syscall overhead has gone from 10 nanoseconds to 30!
While percentiles are a good first step for analyzing your service level objectives, what if we could look at the change in the number of requests that exceeded that objective? Say if 10% of your syscall requests exceeded 10 nanoseconds before the spectre patch, but after patching 50% of those syscalls exceeded 10 nanoseconds?
We can also use histograms to calculate inverse quantiles. Humans can reason about thresholds more naturally than they can about quantiles.
If my web service gets a surge in requests, and my 99th percentile response time doesn’t change, that not only means that I just got a bunch of angry users whose requests took too long, but even worse I don’t know by how much those requests exceeded that percentile. I don’t know how bad the bleeding is.
Inverse quantiles allow me to set a threshold sample value, then calculate what percentage of values exceeded it.
To calculate the inverse quantile, we start with the target sample and work backwards towards the target count Q
Given the previous equation we had, we can use some middle school level algebra (well, it was when I was in school) and solve for Q
Solving for Q, we get 700, which is expected for our value of 1.05. This handles the
Now that we know Q, we can add up the counts to the left of it, subtract that from the total and then divide by the total to get the percentage of sample values which exceeded our sample value X.
So if we are running a website, and we know from industry research that we’ll lose users if our requests take longer than three seconds, we can set X at 3 seconds, calculate the inverse quantile for our request times, and figure out what percentages of our users are getting mad at us. Let’s take a look at how we have been doing that in real time with Circonus.
This is a heatmap of one year’s worth of latency data for a web service. It contains about 300 million samples. Each histogram window in the heatmap is one day’s worth of data, five minute histograms are merged together to create that window. The overlay window shows the distribution for the day where the mouse is hovering over.
Here we added a 99th percentile overlay which you can see implemented as the green lines. It’s pretty easy to spot the monthly recurring rises in the percentile to around 10 seconds. Looks like a network timeout issue, those usually default to around 10 seconds. For most of the time the 99th percentile is relatively low, a few hundred millisecond.
Here we can see the inverse quantile shown for 500 milliseconds request times. As you can see, for most of the graph, at least 90% of the requests are finishing within the 500 millisecond service level objective. We can still see the monthly increases which we believe are related to network client timeouts, but when they spike, they only affect about 25% of requests - not great, but at least we know the extent of that our SLO is exceeded.
We can take that percentage of requests that violated our SLO of 500 milliseconds, and multiply them by the total number of requests to get the number of requests which exceeded 500 milliseconds. This has direct bearing on your business if each of these failed requests is costing you money.
Note that we’ve dropped the range here to a month to get a closer look at the data presented by these calculations.
What is we sum up over time the number of requests that exceeded 500 millseconds? Here integrate the number of requests that exceeded the SLO over time, and plot that as the increasing line. You can clearly see now where things get bad with our service by the increase in slope of the blue line. What if we had a way to automatically detect when that is happening and then page whoever is on call?
Here we can see the red line on the graph is the output of a constant anomaly detection algorithm. When the number of SLO violations increases, the code identifies it as an anomaly and rates it on a 0-100 scale. There are several anomaly detection algorithms available out there, and most don’t use a lot of complicated machine learning, just statistics.