Short Version
Welford's Online Algorithm lets you keep a running value for variance - meaning you don't have to keep all the values (e.g. in a memory constraned system).
Is there something similar for Interquartile Range (IQR)? An online algorithm that lets me know the middle 50% range without having to keep all historical values?
Long Version
Keeping a running average of data, where you are memory constrainted, is pretty easy:
Double sumInt64 count
And from this you can compute the mean:
mean = sum / count
This allows hours, or years, of observations to quietly be collected, but only take up 16-bytes.
Welford's Algorithm for Variance
Normally when you want the variance (or standard deviation), you have to have all your readings, because you have to compute reading-mean for all previous readings:
Double sumOfSquaredError = 0;
foreach (Double reading in Readings)
sumOfSquaredError += Math.Square(reading - mean);
Double variance = sumOfSquaredError / count
Which can add up to terrabytes of data over time, rather than taking up something like 16 bytes.
Which is why it was nice when Welford came up with an online algorithm for computing variance of a stream of readings:
It is often useful to be able to compute the variance in a single pass, inspecting each value xi only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.
The algorithm for adding a new value to the running variance is:
void addValue(Double newValue) {
Double oldMean = sum / count;
sum += newValue;
count += 1;
Double newMean = sum / count;
if (count > 1)
variance = ((count-2)*variance + (newValue-oldMean)*(newValue-newMean)) / (count-1);
else
variance = 0;
}
How about an online algorithm for Interquartile Range (IQR)?
Interquartile Range (IRQ) is another method of getting the spread of data. It tells you how wide the middle 50% of the data is:
And from that people then generally draw a IQR BoxPlot:
Or at the very least, have the values Q1 and Q3.
Is there a way to calculate the Interquartile Range without having to keep all the recorded values?
In other words:
Is there something like Welford's online variance algorithm, but for Interquartile Range?
Knuth, Seminumerical Algorithms
You can find Welford's algorithm explained in Knuth's 2nd volume Seminumerical Algorithms:
(just in case anyone thought this isn't computer science or programming related)
Research Effort
- Stackoverflow: Simple algorithm for online outlier detection of a generic time series
- Stats: Simple algorithm for online outlier detection of a generic time series
- Online outlier detection for data streams (IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & Applications, September 2011, Pages 88–96)
- Stats: Robust outlier detection in financial timeseries
- Stats: Online outlier detection
- Distance-based outlier detection in data streams (Proceedings of the VLDB Endowment, Volume 9, Issue 12, August 2016, pp 1089–1100) pdf
- Online Outlier Detection Over Data Streams (Hongyin Cui, Masters Thesis, 2005)

















