5 SAMPLING, HISTOGRAMS, AND RANDOM VARIATE GENERATION

J. S. Vitter. “Faster Methods for Random Sampling,” Communications of the ACM, 27(7), July 1984, 703-718.
Several new methods are presented for selecting records at random without replacement from a file containing records. Each algorithm selects the records for the sample in a sequential manner -- in the same order the records appear in the file. The algorithms are online in that the records for the sample are selected iteratively with no preprocessing. The algorithms require a constant amount of space and are short and easy to implement. The main result of this paper is the design and analysis of Algorithm D, which does the sampling in time, on the average; roughly uniform random variates are generated, and approximately exponentiation operations (of the form , for real numbers and ) are performed during the sampling. This solves an open problem in the literature. CPU timings on a large mainframe computer indicate that Algorithm D is significantly faster than the sampling algorithms in use today.
For an improved and optimized version of the random sampling method, see a related paper. For reservoir methods, where is not known in advance, see a related paper.
J. S. Vitter. “Random Sampling with a Reservoir,” ACM Transactions on Mathematical Software, 11(1), March 1985, 37-57.
We introduce fast algorithms for selecting a random sample of records without replacement from a pool of records, where the value of is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in $O(n(1 + \log(N/n)))$ expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.
For sampling methods where is known in advance, see a related paper.
J. S. Vitter. “An Efficient Algorithm for Sequential Random Sampling,” ACM Transactions on Mathematical Software, 13(1), March 1987, 58-67.
This paper presents an improved and optimized version of the random sampling method from J. S. Vitter, “Faster Methods for Random Sampling,” Communications of the ACM, 27(7), July 1984, 703-718. The object is to choose in sequential online fashion a random sample of size from a universe of size . For reservoir methods, where is not known in advance, see a related paper.
Y. Matias, J. S. Vitter and W.-C. Ni. “Dynamic Generation of Discrete Random Variates,” Theory of Computing Systems, 36(4), 2003, 329-358. A shorter version appears in Proceedings of the 4th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '93), Austin, TX, January 1993, 361-370.
We present and analyze efficient new algorithms for generating a random variate distributed according to a dynamically changing set of weights. The base version of each algorithm generates the discrete random variate in $O(\log^* N)$ expected time and updates a weight in $O(2^{\log^* N})$ expected time in the worst case. We then show how to reduce the update time to $O(\log^* N)$ amortized expected time. We show how to apply our techniques to a recent lookup table technique in order to obtain an expected constant time in the worst case for generation and update. The algorithms are conceptually simple. We give parallel algorithms for parallel generation and update having optimal processors-time product. We also give an efficient dynamic algorithm for maintaining approximate heaps of elements; each query is required to return an element whose value is within an $\epsilon$ factor of the maximal element value. For $\epsilon= 1/{\mathop{\rm polylog}}(N)$ , each query, insertion, or deletion takes $O(\log\log\log N)$ time.
Keywords: random number generator, random variate, alias, bucket, rejection, dynamic data structure, update, approximate priority queue.
Y. Matias, J. S. Vitter, and N. Young. “Approximate Data Structures with Applications,” Proceedings of the 5th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '94), Alexandria, VA, January 1994. The work is the basis of a patent by the authors, “Implementation of Approximate Data Structures,” United States Patent No. 5,519,840, Bell Laboratories, May 21, 1996.
In this paper we introduce the notion of approximate data structures, in which a small amount of error is tolerated in the output. Approximate data structures trade error of approximation for faster operation, leading to theoretical and practical speedups for a wide variety of algorithms. We give approximate variants of the van Emde Boas data structure, which support the same dynamic operations as the standard van Emde Boas data structure, except that answers to queries are approximate. The variants support all operations in constant time provided the error of approximation is $1/{\mathop{\rm polylog}}(n)$ , and in $O(\log\log n)$ time provided the error is $1/{\mathop{\rm polynomial}}(n)$ , for elements in the data structure.
We consider the tolerance of prototypical algorithms to approximate data structures. We study in particular Prim's minimum spanning tree algorithm, Dijkstra's single-source shortest paths algorithm, and an on-line variant of Graham's convex hull algorithm. To obtain output which approximates the desired output with the error of approximation tending to zero, Prim's algorithm requires only linear time, Dijkstra's algorithm requires $O(m\log\log n)$ time, and the on-line variant of Graham's algorithm requires constant amortized time per operation.
Y. Matias, J. S. Vitter, and M. Wang. “Wavelet-Based Histograms for Selectivity Estimation,” Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98), Seattle, Washington, June 1998.
Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query , we need to estimate the fraction of records in the database that satisfy . Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations.
In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data distributions give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using them in an on-line fashion for selectivity estimation. Our histograms also provide quick approximate answers to OLAP queries when the exact answers are not required. Our method captures the joint distribution of multiple attributes effectively, even when the attributes are correlated. Experiments confirm that our histograms offer substantial improvements in accuracy over random sampling and other previous approaches.
J. S. Vitter, M. Wang, and B. Iyer. “Data Cube Approximation and Histograms via Wavelets,” Proceedings of Seventh International Conference on Information and Knowledge Management (CIKM'98), Washington D.C., November 1998.
There has recently been an explosion of interest in the analysis of data in data warehouses in the field of On-Line Analytical Processing (OLAP). Data warehouses can be extremely large, yet obtaining quick answers to queries is important. In many situations, obtaining the exact answer to an OLAP query is prohibitively expensive in terms of time and/or storage space. It can be advantageous to have fast, approximate answers to queries.
In this paper, we present an I/O-efficient technique based upon a multiresolution wavelet decomposition that yields an approximate and space-efficient representation of the data cube, which is one of the core OLAP operators. We build our compact data cube on the logarithms of the partial sums of the raw data values of a multidimensional array. We get excellent approximations for on-line range-sum queries with limited space usage and computational cost. Multiple data cubes can be handled simultaneously. Each query can generally be answered, depending upon the accuracy supported, in one I/O or a small number of I/Os. Experiments show that our method performs significantly better than other approximation techniques such as histograms and random sampling.
J. S. Vitter and M. Wang. “Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets,” Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99), Philadelphia, PA, June 1999.
Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.
In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.
We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.
Y. Matias, J. S. Vitter, and M. Wang. “Dynamic Maintenance of Wavelet-Based Histograms,” Proceedings of the 26th International Conference on Very Large Databases (VLDB '00), Cairo, Egypt, September 2000.
In this paper, we introduce an efficient method for the dynamic maintenance of wavelet-based histograms (and other transform-based histograms). Previous work has shown that wavelet-based histograms provide more accurate selectivity estimation than traditional histograms, such as equi-depth histograms. But since wavelet-based histograms are built by a nontrivial mathematical procedure, namely, wavelet transform decomposition, it is hard to maintain the accuracy of the histogram when the underlying data distribution changes over time. In particular, simple techniques, such as split and merge, which works well for equi-depth histograms, and updating a fixed set of wavelet coefficients, are not suitable here.
We propose a novel approach based upon probabilistic counting and sampling to maintain wavelet-based histograms with very little online time and space costs. The accuracy of our method is robust to changing data distributions, and we get a considerable improvement over previous methods for updating transform-based histograms. A very nice feature of our method is that it can be extended naturally to maintain multidimensional wavelet-based histograms, while traditional multidimensional histograms can be less accurate and prohibitively expensive to build and maintain.
L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Parr. “XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path Selectivity Estimation,” Proceedings of the 28th International Conference on Very Large Databases (VLDB '02), Hong Kong, China, August 2002.
The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics.
In this paper, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repositories are likely to be inaccessible or too large to be scanned entirely. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload aware in collecting the statistics and thus can be dramatically more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using several real data sets.
L. Lim, M. Wang, and J. S. Vitter. “SASH: A Self-Adaptive Histogram Set for Dynamically Changing Workloads,” Proceedings of the 29th International Conference on Very Large Databases (VLDB '03), Berlin, Germany, September 2003.
Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. These selectivities are typically used for cost-based query optimization. While the problem of building an accurate histogram for a given attribute or attribute set has been well-studied, little attention has been given to the problem of building and tuning a set of histograms collectively for multidimensional queries in a self-managed manner based only on query feedback.
In this paper, we present SASH, a Self-Adaptive Set of Histograms that addresses the problem of building and maintaining a set of histograms. SASH uses a novel two-phase method to automatically build and maintain itself using query feedback information only. In the online tuning phase, the current set of histograms is tuned in response to the estimation error of each query in an online manner. In the restructuring phase, a new and more accurate set of histograms replaces the current set of histograms. The new set of histograms (attribute sets and memory distribution) is found using information from a batch of query feedback. We present experimental results that show the effectiveness and accuracy of our approach.
L. Lim, M. Wang, and J. S. Vitter. “CXHist: An On-line Classification-based Histogram for XML String Selectivity Estimation,” Proceedings of the 31st International Conference on Very Large Databases (VLDB '05), Trondheim, Norway, August-September 2005.
Query optimization in IBM's System RX, the first truly hybrid relational-XML data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For most real XML data, the number of distinct string values at the leaf nodes is orders of magnitude larger than the set of distinct rooted tag paths. Hence, the real challenge lies in accurate selectivity estimation of the string predicates on the leaf values reachable via a given path.
In this paper, we present CXHist, a novel workload-aware histogram technique that provides accurate selectivity estimation on a broad class of XML string-based queries. CXHist builds a histogram in an on-line manner by grouping queries into buckets using their true selectivity obtained from query feedback. The set of queries associated with each bucket is summarized into feature distributions. These feature distributions mimic a Bayesian classifier that is used to route a query to its associated bucket during selectivity estimation. We show how CXHist can be used for two general types of (path,string) queries: exact match queries and substring match queries. Experiments using a prototype show that CXHist provides accurate selectivity estimation for both exact match queries and substring match queries.