Class MergingDigest

  • All Implemented Interfaces:
    java.io.Serializable

    public class MergingDigest
    extends AbstractTDigest
    Maintains a t-digest by collecting new points in a buffer that is then sorted occasionally and merged into a sorted array that contains previously computed centroids.

    This can be very fast because the cost of sorting and merging is amortized over several insertion. If we keep N centroids total and have the input array is k long, then the amortized cost is something like

    N/k + log k

    These costs even out when N/k = log k. Balancing costs is often a good place to start in optimizing an algorithm. For different values of compression factor, the following table shows estimated asymptotic values of N and suggested values of k:

    CompressionNk
    507825
    10015742
    20031473

    The virtues of this kind of t-digest implementation include:

    • No allocation is required after initialization
    • The data structure automatically compresses existing centroids when possible
    • No Java object overhead is incurred for centroids since data is kept in primitive arrays

    The current implementation takes the liberty of using ping-pong buffers for implementing the merge resulting in a substantial memory penalty, but the complexity of an in place merge was not considered as worthwhile since even with the overhead, the memory cost is less than 40 bytes per centroid which is much less than half what the AVLTreeDigest uses. Speed tests are still not complete so it is uncertain whether the merge strategy is faster than the tree strategy.

    See Also:
    Serialized Form
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  MergingDigest.Encoding  
    • Constructor Summary

      Constructors 
      Constructor Description
      MergingDigest​(double compression)
      Allocates a buffer merging t-digest.
      MergingDigest​(double compression, int bufferSize)
      If you know the size of the temporary buffer for incoming points, you can use this entry point.
      MergingDigest​(double compression, int bufferSize, int size)
      Fully specified constructor.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void add​(double[] m, double[] w, int count, java.util.List<java.util.List<java.lang.Double>> data)  
      void add​(double x, int w)
      Adds a sample to a histogram.
      (package private) void add​(double x, int w, Centroid base)  
      private void add​(double x, int w, java.util.List<java.lang.Double> history)  
      void add​(java.util.List<? extends TDigest> others)  
      void asBytes​(java.nio.ByteBuffer buf)
      Serialize this TDigest into a byte buffer.
      (package private) static double asinApproximation​(double x)  
      void asSmallBytes​(java.nio.ByteBuffer buf)
      Serialize this TDigest into a byte buffer.
      private static double bound​(double v)  
      int byteSize()
      Returns the number of bytes required to encode this TDigest using #asBytes().
      double cdf​(double x)
      Returns the fraction of all points added which are <= x.
      int centroidCount()  
      java.util.Collection<Centroid> centroids()
      A Collection that lets you go through the centroids in ascending order by mean.
      (package private) int checkWeights()
      Exposed for testing.
      private int checkWeights​(double[] w, double total, int last)  
      void compress()
      Re-examines a t-digest to determine whether some centroids are redundant.
      double compression()
      Returns the current compression factor.
      private static double eval​(double[] model, double[] vars)  
      static MergingDigest fromBytes​(java.nio.ByteBuffer buf)  
      private double integratedLocation​(double q)
      Converts a quantile into a centroid scale value.
      private double integratedQ​(double k)  
      private void merge​(double[] incomingMean, double[] incomingWeight, int incomingCount, java.util.List<java.util.List<java.lang.Double>> incomingData, int[] incomingOrder, double unmergedWeight)  
      private void mergeNewValues()  
      double quantile​(double q)
      Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
      TDigest recordAllData()
      Turns on internal data recording.
      long size()
      Returns the number of points that have been added to this TDigest.
      int smallByteSize()
      Returns the number of bytes required to encode this TDigest using #asSmallBytes().
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • compression

        private final double compression
      • lastUsedCell

        private int lastUsedCell
      • totalWeight

        private double totalWeight
      • weight

        private final double[] weight
      • mean

        private final double[] mean
      • data

        private java.util.List<java.util.List<java.lang.Double>> data
      • unmergedWeight

        private double unmergedWeight
      • tempUsed

        private int tempUsed
      • tempWeight

        private final double[] tempWeight
      • tempMean

        private final double[] tempMean
      • tempData

        private java.util.List<java.util.List<java.lang.Double>> tempData
      • order

        private final int[] order
      • usePieceWiseApproximation

        private static boolean usePieceWiseApproximation
      • useWeightLimit

        private static boolean useWeightLimit
    • Constructor Detail

      • MergingDigest

        public MergingDigest​(double compression)
        Allocates a buffer merging t-digest. This is the normally used constructor that allocates default sized internal arrays. Other versions are available, but should only be used for special cases.
        Parameters:
        compression - The compression factor
      • MergingDigest

        public MergingDigest​(double compression,
                             int bufferSize)
        If you know the size of the temporary buffer for incoming points, you can use this entry point.
        Parameters:
        compression - Compression factor for t-digest. Same as 1/\delta in the paper.
        bufferSize - How many samples to retain before merging.
      • MergingDigest

        public MergingDigest​(double compression,
                             int bufferSize,
                             int size)
        Fully specified constructor. Normally only used for deserializing a buffer t-digest.
        Parameters:
        compression - Compression factor
        bufferSize - Number of temporary centroids
        size - Size of main buffer
    • Method Detail

      • recordAllData

        public TDigest recordAllData()
        Turns on internal data recording.
        Overrides:
        recordAllData in class AbstractTDigest
        Returns:
        This TDigest so that configurations can be done in fluent style.
      • add

        public void add​(double x,
                        int w)
        Description copied from class: TDigest
        Adds a sample to a histogram.
        Specified by:
        add in class TDigest
        Parameters:
        x - The value to add.
        w - The weight of this point.
      • add

        private void add​(double x,
                         int w,
                         java.util.List<java.lang.Double> history)
      • add

        private void add​(double[] m,
                         double[] w,
                         int count,
                         java.util.List<java.util.List<java.lang.Double>> data)
      • add

        public void add​(java.util.List<? extends TDigest> others)
        Specified by:
        add in class TDigest
      • mergeNewValues

        private void mergeNewValues()
      • merge

        private void merge​(double[] incomingMean,
                           double[] incomingWeight,
                           int incomingCount,
                           java.util.List<java.util.List<java.lang.Double>> incomingData,
                           int[] incomingOrder,
                           double unmergedWeight)
      • checkWeights

        int checkWeights()
        Exposed for testing.
      • checkWeights

        private int checkWeights​(double[] w,
                                 double total,
                                 int last)
      • integratedLocation

        private double integratedLocation​(double q)
        Converts a quantile into a centroid scale value. The centroid scale is nominally the number k of the centroid that a quantile point q should belong to. Due to round-offs, however, we can't align things perfectly without splitting points and centroids. We don't want to do that, so we have to allow for offsets. In the end, the criterion is that any quantile range that spans a centroid scale range more than one should be split across more than one centroid if possible. This won't be possible if the quantile range refers to a single point or an already existing centroid.

        This mapping is steep near q=0 or q=1 so each centroid there will correspond to less q range. Near q=0.5, the mapping is flatter so that centroids there will represent a larger chunk of quantiles.

        Parameters:
        q - The quantile scale value to be mapped.
        Returns:
        The centroid scale value corresponding to q.
      • integratedQ

        private double integratedQ​(double k)
      • asinApproximation

        static double asinApproximation​(double x)
      • eval

        private static double eval​(double[] model,
                                   double[] vars)
      • bound

        private static double bound​(double v)
      • compress

        public void compress()
        Description copied from class: TDigest
        Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space. The cost is roughly the same as adding as many data points as there are centroids. This is typically < 10 * compression, but could be as high as 100 * compression. This is a destructive operation that is not thread-safe.
        Specified by:
        compress in class TDigest
      • size

        public long size()
        Description copied from class: TDigest
        Returns the number of points that have been added to this TDigest.
        Specified by:
        size in class TDigest
        Returns:
        The sum of the weights on all centroids.
      • cdf

        public double cdf​(double x)
        Description copied from class: TDigest
        Returns the fraction of all points added which are <= x.
        Specified by:
        cdf in class TDigest
      • quantile

        public double quantile​(double q)
        Description copied from class: TDigest
        Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
        Specified by:
        quantile in class TDigest
        Parameters:
        q - The desired fraction
        Returns:
        The value x such that cdf(x) == q
      • centroids

        public java.util.Collection<Centroid> centroids()
        Description copied from class: TDigest
        A Collection that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
        Specified by:
        centroids in class TDigest
        Returns:
        The centroids in the form of a Collection.
      • compression

        public double compression()
        Description copied from class: TDigest
        Returns the current compression factor.
        Specified by:
        compression in class TDigest
        Returns:
        The compression factor originally used to set up the TDigest.
      • byteSize

        public int byteSize()
        Description copied from class: TDigest
        Returns the number of bytes required to encode this TDigest using #asBytes().
        Specified by:
        byteSize in class TDigest
        Returns:
        The number of bytes required.
      • smallByteSize

        public int smallByteSize()
        Description copied from class: TDigest
        Returns the number of bytes required to encode this TDigest using #asSmallBytes(). Note that this is just as expensive as actually compressing the digest. If you don't care about time, but want to never over-allocate, this is fine. If you care about compression and speed, you pretty much just have to overallocate by using allocating #byteSize() bytes.
        Specified by:
        smallByteSize in class TDigest
        Returns:
        The number of bytes required.
      • asBytes

        public void asBytes​(java.nio.ByteBuffer buf)
        Description copied from class: TDigest
        Serialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary.
        Specified by:
        asBytes in class TDigest
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • asSmallBytes

        public void asSmallBytes​(java.nio.ByteBuffer buf)
        Description copied from class: TDigest
        Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.
        Specified by:
        asSmallBytes in class TDigest
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • fromBytes

        public static MergingDigest fromBytes​(java.nio.ByteBuffer buf)