Stochastic Histograms

Histograms are useful, but their storage requirements can be large: if each bin is N bytes, then the storage is (number of bins  ×  N) bytes.

For many applications (imaging, profiling), the exact count is not important.

Here's a way of reducing the storage required for each bin. Instead of holding a simple count, the bin contains an approximation to the logarithm to base b of the count. This logarithm is a smaller number than the count, so can use less storage.

In this scheme, when incrementing a bin with value n, which represents bn, the probability that it should increment is given by:

P(increment) = (1)/(bn + 1 − bn) = (1)/(bn(b − 1))

For b = 2 then the probability is:

P(increment) = (1)/(2n)

A simple demonstration using a bases of 1.1, 2.0 and 1.0:

import random

class LinearHistogram:

    def __init__(self):
        self.bins = [ 0 for i in range(10) ]

    def increment(self, i):
        self.bins[i] += 1

    def counts(self):
        return self.bins

class StochasticHistogram:

    def __init__(self, base):
        self.bins = [ 0 for i in range(10) ]
        self.base = base

    def increment(self, i):
        prob = 1.0 / ((self.base ** self.bins[i]) * (self.base - 1))
        if random.random() < prob:
            self.bins[i] += 1

    def counts(self):
        return [int(self.base ** n) for n in self.bins]

data = [random.randrange(10) for i in range(1000000)]

lh = LinearHistogram()
for d in data:
    lh.increment(d)
print 'linear      ', lh.counts()

for base in [1.1, 2, 10]:
    sh = StochasticHistogram(base)
    for d in data:
        sh.increment(d)
    print 'stochastic %g:' % base, sh.counts()

Gives:

linear       [100398, 99812, 99592, 100007, 100130, 100025, 99895, 100272, 99970, 99899]

stochastic 1.1: [101979, 76619, 84280, 84280, 84280, 101979, 135735, 84280, 92709, 123395]

stochastic 2: [131072, 65536, 65536, 262144, 131072, 65536, 131072, 131072, 131072, 131072]

stochastic 10: [100000, 100000, 100000, 100000, 10000, 100000, 100000, 10000, 100000, 100000]