Histograms are useful, but their storage requirements can be large: if each bin is N bytes, then the storage is (number of bins × N) bytes.
For many applications (imaging, profiling), the exact count is not important.
Here's a way of reducing the storage required for each bin. Instead of holding a simple count, the bin contains an approximation to the logarithm to base b of the count. This logarithm is a smaller number than the count, so can use less storage.
In this scheme, when incrementing a bin with value n, which represents bn, the probability that it should increment is given by:
For b = 2 then the probability is:
A simple demonstration using a bases of 1.1, 2.0 and 1.0:
import random class LinearHistogram: def __init__(self): self.bins = [ 0 for i in range(10) ] def increment(self, i): self.bins[i] += 1 def counts(self): return self.bins class StochasticHistogram: def __init__(self, base): self.bins = [ 0 for i in range(10) ] self.base = base def increment(self, i): prob = 1.0 / ((self.base ** self.bins[i]) * (self.base - 1)) if random.random() < prob: self.bins[i] += 1 def counts(self): return [int(self.base ** n) for n in self.bins] data = [random.randrange(10) for i in range(1000000)] lh = LinearHistogram() for d in data: lh.increment(d) print 'linear ', lh.counts() for base in [1.1, 2, 10]: sh = StochasticHistogram(base) for d in data: sh.increment(d) print 'stochastic %g:' % base, sh.counts()
Gives:
linear [100398, 99812, 99592, 100007, 100130, 100025, 99895, 100272, 99970, 99899] stochastic 1.1: [101979, 76619, 84280, 84280, 84280, 101979, 135735, 84280, 92709, 123395] stochastic 2: [131072, 65536, 65536, 262144, 131072, 65536, 131072, 131072, 131072, 131072] stochastic 10: [100000, 100000, 100000, 100000, 10000, 100000, 100000, 10000, 100000, 100000]