ABSTRACT
An ∈-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of ∈N.
We present a new online algorithm for computing∈-approximate quantile summaries of very large data sequences. The algorithm has a worst-case space requirement of Ο(1÷∈ log(∈N)). This improves upon the previous best result of Ο(1÷∈ log2(∈N)). Moreover, in contrast to earlier deterministic algorithms, our algorithm does not require a priori knowledge of the length of the input sequence.
Finally, the actual space bounds obtained on experimental data are significantly better than the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms.
- 1.Rakesh Agrawal and Arun Swami. A one-pass space-efficient algorithm for finding quantiles. Proc. 7th Int. Conf. Management of Data, COMAD, 28-30 December 1995.Google Scholar
- 2.Khaled Alsabti, Sanjay Ranka, and Vineet Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. Proceedings of the 23rd Intl. Conference onVery Large Data Bases, Athens, Greece, 26-29 August 1997, pages 346-355, Los Altos, CA 94022, USA, 1997. Morgan Kaufmann Publishers. Google ScholarDigital Library
- 3.Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. Random sampling for histogram construction: how much is enough? In ACM SIGMOD '98, volume 28, pages 436-447, Seattle, WA, June 1-4, 1998. Google ScholarDigital Library
- 4.Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In Proceedings of the 23rd Intl. Conf. Very Large Data Bases, VLDB, pages 466-475. Morgan Kaufmann, 25-27 August 1997. Google ScholarDigital Library
- 5.Michael B. Greenwald. Practical algorithms for self scaling histograms or better than average data collection. Performance Evaluation, 27&28:19-40, October 1996. Google ScholarDigital Library
- 6.R. Jain and I. Chlamtac. The P2 algorithm for dynamic calculation of quantile and histograms without storing observations. Communications of the ACM, 28(10):1076-1085, October 1986. Google ScholarDigital Library
- 7.I. Pohl. A minimum storage algorithm for computing the median. IBM Research Report RC 2701, November 1969.Google Scholar
- 8.Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. ACM SIGMOD '98, volume 28, pages 426-435, Seattle, WA, June 1998. Google ScholarDigital Library
- 9.Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In ACM SIGMOD '99, volume 29, pages 251-262. Philadelphia, PA, June 1999. Google ScholarDigital Library
- 10.J. I. Munro and M.S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, vol. 12: 315-323; 1980.Google ScholarCross Ref
- 11.M.S. Paterson. Progress in selection. Technical Report, University ofWarwick, Coventry, UK, 1997.Google Scholar
- 12.Viswanath Poosala, Venkatesh Ganti, and Yannis E. Ioannidis. Approximate query answering using histograms. Bulletin of the IEEE Technical Committee on Data Engineering, 22(4):6-15, December 1999.Google Scholar
- 13.Viswanath Poosala, Peter J. Haas, Yannis E. Ioannidis, and Eugene J. Shekita. Improved histograms for selectivity estimation of range predicates. In ACM SIGMOD 96, volume 26, pages 294-305, Montreal, Quebec, Canada, June 4-6, 1996. Google ScholarDigital Library
Index Terms
- Space-efficient online computation of quantile summaries
Recommendations
Space-efficient online computation of quantile summaries
An ∈-approximate quantile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of ∈N.
We present a new online algorithm for computing∈-approximate quantile summaries of very ...
Super-linear time-space tradeoff lower bounds for randomized computation
FOCS '00: Proceedings of the 41st Annual Symposium on Foundations of Computer ScienceWe prove the first time-space lower bound tradeoffs for randomized computation of decision problems. The bounds hold even in the case that the computation is allowed to have arbitrary probability of error on a small fraction of inputs. Our techniques ...
Comments