A good introduction on external memory algorithms and data structures is my book on the subject.

- A. Aggarwal and J. S. Vitter.
“The Input/Output Complexity of Sorting and Related
Problems,”
*Communications of the ACM*,**31**(9), September 1988, 1116-1127. A shorter version appears in “The I/O Complexity of Sorting and Related Problems,”*Proceedings of the 14th Annual International Colloquium on Automata, Languages, and Programming (ICALP '87)*, Karlsruhe, West Germany, July 1987, published in Lecture Notes in Computer Science,**267**, Springer-Verlag, Berlin.We provide tight upper and lower bounds, up to a constant factor, for the number of inputs and outputs (I/Os) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition. The bounds hold both in the worst case and in the average case, and in several situations the constant factors match.

Secondary storage is modeled as a magnetic disk capable of transferring blocks each containing records in a single time unit; the records in each block must be input from or output to contiguous locations on the disk. We give two optimal algorithms for the problems, which are variants of merge sorting and distribution sorting. In particular we show for that the standard merge sorting algorithm is an optimal external sorting method, up to a constant factor in the number of I/Os. Our sorting algorithms use the same number of I/Os as does the permutation phase of key sorting, except when the internal memory size is extremely small, thus affirming the popular adage that key sorting is not faster. We also give a simpler and more direct derivation of Hong and Kung's lower bound for the FFT for the special case .

- M. H. Nodine, D. P. Lopresti and J. S. Vitter.
“I/O Overhead and Parallel VLSI Architectures for Lattice
Computations,”
*IEEE Transactions on Computers*,**40**(7), July 1991, 843-852.In this paper we introduce input/output (I/O) overhead as a complexity measure for VLSI implementations of two-dimensional lattice computations of the type arising in the simulation of physical systems. We show by pebbling arguments that when there are processing elements available. If the results are required to be observed at every generation, and no on-chip storage is allowed, we show the lower bound is the constant 2. We then examine four VLSI architectures and show that one of them, the multi-generation sweep architecture, also has I/O overhead proportional to . We compare the constants of proportionality between the lower bound and the architecture. Finally, we prove a closed-form for the discrete minimization equation giving the optimal number of generations to compute for the multi-generation sweep architecture.

- J. S. Vitter and E. A. M. Shriver.
“Algorithms for Parallel
Memory I: Two-Level Memories,”
double special issue on large-scale
memories in
*Algorithmica*,**12**(2-3), 1994, 110-147. A shorter version appears in “Optimal Disk I/O with Parallel Block Transfer,”*Proceedings of the 22nd Annual ACM Symposium on Theory of Computing (STOC '90)*, Baltimore, MD, May 1990, 159-169.We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the secondary storage devices can simultaneously transfer a contiguous block of records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system with disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into separate physical devices. Our algorithms' performance can be significantly better than those obtained by the well-known but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than times the optimal number of I/Os is exponentially small in , where is the internal memory size.

- J. S. Vitter and E. A. M. Shriver.
“Algorithms for Parallel
Memory II: Hierarchical Multilevel Memories,”
double special issue on
large-scale memories in
*Algorithmica*,**12**(2-3), 1994, 148-169. A shorter version appears in “Optimal Disk I/O with Parallel Block Transfer,”*Proceedings of the 22nd Annual ACM Symposium on Theory of Computing (STOC '90)*, Baltimore, MD, May 1990, 159-169.In this paper we introduce parallel versions of two hierarchical memory models and give optimal algorithms in these models for sorting, FFT, and matrix multiplication. In our parallel models, there are memory hierarchies operating simultaneously; communication among the hierarchies takes place at a base memory level. Our optimal sorting algorithm is randomized and is based upon the probabilistic partitioning technique developed in the companion paper for optimal disk sorting in a two-level memory with parallel block transfer. The probability of using times the optimal running time is exponentially small in .

- M. H. Nodine and J. S. Vitter.
“Greed Sort: An Optimal Sorting Algorithm on
Parallel Disks,”
*Journal of the ACM*,**42**(4), July 1995, 919-933. An extended abstract appears in “Large-Scale Sorting in Parallel Memories,”*Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '91)*, Hilton Head, SC, July 1991, 29-39.We present an algorithm for sorting efficiently with parallel two-level memories. Our main result is an elegant, easy-to-implement, optimal, detemzinistic algorithm for external sorting with disk drives. This result answers in the affirmative the open problem posed by Vitter and Shriver of whether an optimal algorithm exists that is deterministic. Our measure of performance is the number of parallel input/output (I/0) operations, in which each of the disks can simultaneously transfer a block of contiguous records. We assume that internal memory can hold records. Our algorithm sorts records in the optimal bound of deterministically, and thus it improves upon Vitter and Shriver’s optimal randomized algorithm as well as the well-known deterministic but nonoptimal technique of disk striping. It is also practical to implement.

- J. S. Vitter and M. H. Nodine.
“Large-Scale Sorting in Uniform Memory
Hierarchies,”
special issue on parallel I/O systems
in
*Journal of Parallel and Distributed Computing*, 17, January 1993, 107-114. A shorter version appears in “Large-Scale Sorting in Parallel Memories,”*Proceedings of the 3rd Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '91)*, Hilton Head, SC, July 1991, 29-39.We present several efficient algorithms for sorting on the uniform memory hierarchy (UMH), introduced by Alpern, Carter, and Feig, and its parallelization P-UMH. We give optimal and nearly-optimal algorithms for a wide range of bandwidth degradations, including a parsimonious algorithm for constant bandwidth. We also develop optimal sorting algorithms for all bandwidths for other versions of UMH and P-UMH, including natural restrictions we introduce called RUMH and P-RUMH, which more closely correspond to current programming languages.

- M. H. Nodine and J. S. Vitter.
“Optimal Deterministic Sorting on
Parallel Disks.”
A shorter version appears in
*Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '93)*, Velen, Germany, June-July 1993, 120-129.We present a load balancing technique that leads to an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks. Our measure of performance is the number of input/output (I/O) operations. In each I/O, each of the disks can simultaneously transfer a block of data. Our algorithm improves upon the randomized optimal algorithm of Vitter and Shriver as well as the (non-optimal) commonly-used technique of disk striping. It also improves upon our earlier merge-based sorting algorithm in that it has smaller constants hidden in the big-oh notation, and it is possible to implement using only striped writes (but independent reads). In a companion paper, we show how to modify the algorithm to achieve optimal CPU time, even on parallel processors and parallel memory hierarchies.

- M. H. Nodine and J. S. Vitter.
“Optimal Deterministic Sorting on
Parallel Processors and Parallel Memory Hierarchies,”
A shorter version appears in
*Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '93)*, Velen, Germany, June-July 1993, 120-129.We present a practical deterministic load balancing strategy for distribution sort that is applicable to parallel disks and parallel memory hierarchies with both single and parallel processors. The simplest application of the strategy is an optimal deterministic algorithm called Balance Sort for external sorting on multiple disks with a single CPU, as described in the companion paper. However, the internal processing of Balance Sort does not seem parallelizable. In this paper, we develop an elegant variation that achieves full parallel speedup. The algorithms so derived are optimal for all parallel memory hierarchies with any type of a PRAM base-level interconnection and are either optimal or best-known for a hypercube interconnection. We show how to achieve optimal internal processing time as well as optimal number of I/Os in parallel two-level memories.

- J. S. Vitter.
“Communication Issues in Large-Scale Geometric Computation,”
*ACM Computing Surveys*,**28**(4es), December 1996.Large-scale problems involving geometric data arise in numerous settings, and severe communication bottlenecks can arise in solving them. Work is needed in the development of I/O-efficient algorithms, as well as those that effectively utilize hierarchical memory. In order for new algorithms to be implemented efficiently in practice, the machines they run on must support fundamental external-memory operations. We discuss several advantages offered by TPIE (Transparent Parallel I/O Programming Environment) to enable I/O-efficient implementations.

General Terms: Algorithms, Design, Languages, Performance, Theory. Additional Key Words and Phrases: computational geometry, I/O, external memory, secondary memory, communication, disk drive, parallel disks.

- M. T. Goodrich, J.-J. Tsay, D. E. Vengroff, and J. S. Vitter.
“External-Memory Computational Geometry,”
*Proceedings of the 34th Annual IEEE Symposium on Foundations of Computer Science (FOCS '93)*, Palo Alto, CA, November 1993.In this paper, we give new techniques for designing efficient algorithms for computational geometry problems that are too large to be solved in internal memory, and we use these techniques to develop optimal and practical algorithms for a number of important large-scale problems in computational geometry. Our algorithms are optimal for a wide range of two-level and hierarchical multilevel memory models, including parallel models. The algorithms are optimal in terms of both I/O cost and internal computation.

Our results are built on four fundamental techniques:

*distribution sweeping*, a generic method for externalizing plane-sweep algorithms;*persistent B-trees*, for which we have both on-line and off-line methods;*batch filtering*, a general method for performing simultaneous external-memory searches in any data structure that can be modeled as a planar layered dag; and*external marriage-before-conquest*, an external-memory analog of the well-known technique of Kirkpatrick and Seidel. Using these techniques we are able to solve a very large number of problems in computational geometry, including batched range queries, 2-d and 3-d convex hull construction, planar point location, range queries, finding all nearest neighbors for a set of planar points, rectangle intersection/union reporting, computing the visibility of segments from a point, performing ray-shooting queries in constructive solid geometry (CSG) models, as well as several geometric dominance problems.These results are significant because large-scale problems involving geometric data are ubiquitous in spatial databases, geographic information systems (GIS), constraint logic programming, object oriented databases, statistics, virtual reality systems, and graphics. This work makes a big step, both theoretically and in practice, towards the effective management and manipulation of geometric data in external memory, which is an essential component of these applications.

- M. H. Nodine, M. T. Goodrich, and J. S. Vitter.
“Blocking for External
Graph Searching,”
*Algorithmica*,**16**(2), August 1996, 181-214. A shorter version appears in*Proceedings of the 12th Annual ACM Symposium on Principles of Database Systems (PODS '93)*, Washington, D. C, May 1993.In this paper, we consider the problem of using disk blocks efficiently in searching graphs that are too large to fit in internal memory. Our model allows a vertex to be represented any number of times on the disk in order to take advantage of redundancy. We give matching upper and lower bounds for complete -ary trees and -dimensional grid graphs, as well as for classes of general graphs that intuitively speaking have a close to uniform number of neighbors around each vertex.

- P. C. Kanellakis, S. Ramaswamy, D. E. Vengroff, and J. S. Vitter.
“Indexing for Data Models with Constraints and Classes,”
*Journal of Computer and System Sciences*,**52**(3), 1996, 589-612. A shorter version appeared in*Proceedings of the 12th Annual ACM Symposium on Principles of Database Systems (PODS '93)*, Washington, D. C, May 1993.We examine I/O-efficient data structures that provide indexing support for new data models. The database languages of these models include concepts from constraint programming (e.g., relational tuples are generalized to conjunctions of constraints) and from object-oriented programming (e.g., objects are organized in class hierarchies). Let be the size of the database, the number of classes, the page size on secondary storage, and the size of the output of a query. (1) Indexing by one attribute in many constraint data models is equivalent to external dynamic interval management, which is a special case of external dynamic 2-dimensional range searching. We present a semi-dynamic data structure for this problem that has worst-case space pages, query I/O time and amortized insert I/O time. Note that, for the static version of this problem, this is the first worst-case optimal solution. (2) Indexing by one attribute and by class name in an object-oriented model, where objects are organized as a forest hierarchy of classes, is also a special case of external dynamic 2-dimensional range searching. Based on this observation, we first identify a simple algorithm with good worst-case performance, query I/O time , update I/O time and space pages for the class indexing problem. Using the forest structure of the class hierarchy and techniques from the constraint indexing problem, we improve its query I/O time to .

- Y.-J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengroff, and J. S. Vitter.
“External-Memory Graph
Algorithms,”
*Proceedings of the 6th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '95)*, San Francisco, CA, January 1995.We present a collection of new techniques for designing and analyzing efficient external-memory algorithms for graph problems and illustrate how these techniques can be applied to a wide variety of specific problems. Our results include:

- Proximate-neighboring. We present a simple method for deriving external-memory lower bounds via reductions from a problem we call the “proximate neighbors” problem. We use this technique to derive non-trivial lower bounds for such problems as list ranking, expression tree evaluation, and connected components.
- PRAM simulation. We give methods for efficiently simulating PRAM computations in external memory, even for some cases in which the PRAM algorithm is not work-optimal. We apply this to derive a number of optimal (and simple) external-memory graph algorithms.
- Time-forward processing. We present a general technique for evaluating circuits (or “circuit-like” computations) in external memory. We also use this in a deterministic list ranking algorithm.
- Deterministic 3-coloring of a cycle. We give several optimal methods for 3-coloring a cycle, which can be used as a subroutine for finding large independent sets for list ranking. Our ideas go beyond a straightforward PRAM simulation, and may be of independent interest.
- External depth-first search. We discuss a method for performing depth first search and solving related problems efficiently in external memory. Our technique can be used in conjunction with ideas due to Ullman and Yannakakis in order to solve graph problems involving closed semi-ring computations even when their assumption that vertices fit in main memory does not hold.

Our techniques apply to a number of problems, including list ranking, which we discuss in detail, finding Euler tours, expression-tree evaluation, centroid decomposition of a tree, least-common ancestors, minimum spanning tree verification, connected and biconnected components, minimum spanning forest, ear decomposition, topological sorting, reachability, graph drawing, and visibility representation.

- L. Arge, D. E. Vengroff, and J. S. Vitter.
“External-Memory Algorithms for Processing Line Segments in Geographic
Information Systems,”
*Algorithmica*,**47**(1), January 2007, 1-25. A shorter version appears in*Proceedings of the 3rd Annual European Symposium on Algorithms (ESA '95)*, September 1995, published in Lecture Notes in Computer Science,**979**, Springer-Verlag, Berlin, 295-310.In the design of algorithms for large-scale applications it is essential to consider the problem of minimizing I/O communication. Geographical information systems (GIS) are good examples of such large-scale applications as they frequently handle huge amounts of spatial data. In this paper we develop efficient new external-memory algorithms for a number of important problems involving line segments in the plane, including trapezoid decomposition, batched planar point location, triangulation, red-blue line segment intersection reporting, and general line segment intersection reporting. In GIS systems, the first three problems are useful for rendering and modeling, and the latter two are frequently used for overlaying maps and extracting information from them.

To solve these problems, we combine and modify in novel ways several of the previously known techniques for designing efficient algorithms for external memory. We also develop a powerful new technique that can be regarded as a practical external memory version of fractional cascading. Except for the batched planar point location problem, no algorithms specifically designed for external memory were previously known for these problems. Our algorithms for triangulation and line segment intersection partially answer previously posed open problems, while the batched planar point location algorithm improves on the previously known solution, which applied only to monotone decompositions. Our algorithm for the red-blue line segment intersection problem is provably optimal.

- TPIE: Transparent Parallel I/O Programming Environment.
The TPIE software project, initially begun by Darren Vengroff as part of
his PhD dissertation work, is being carried on at Duke
University and Aarhus University.
TPIE provides a high-level implementation platform for applications
that require efficient external memory access. The manual and
distribution information can be found on the web site.
- D. E. Vengroff and J. S. Vitter.
“I/O-Efficient Scientific Computation using TPIE,”
*Proceedings of the Goddard Conference on Mass Storage Systems and Technologies*, College Park, MD, September 1996, published in NASA Conference Publication 3340, Volume II, 553-570.In recent years, I/O-efficient algorithms for a wide variety of problems have appeared in the literature. Thus far, however, systems specifically designed to assist programmers in implementing such algorithms have remained scarce. TPIE is a system designed to fill this void. It supports I/O-efficient paradigms for problems from a variety of domains, including computational geometry, graph algorithms, and scientific computation. The TPIE interface frees programmers from having to deal not only of explicit read and write calls, but also the complex memory management that must be performed for I/O-efficient computation.

In this paper, we discuss applications of TPIE to problems in scientific computation. We discuss algorithmic issues underlying the design and implementation of the relevant components of TPIE and present performance results of programs written to solve a series of benchmark problems using our current TPIE prototype. Some of the benchmarks we present are based on the NAS parallel benchmarks, while others are of our own creation.

We demonstrate that the CPU overhead required to manage I/O is small and that even with just a single disk the I/O overhead of I/O-efficient computation ranges from negligible to the same order of magnitude as CPU time. We conjecture that if we use a number of disks in parallel this overhead can be all but eliminated.

- D. E. Vengroff, and J. S. Vitter.
“Efficient 3-D Range Searching in
External Memory,”
*Proceedings of the 28th Annual ACM Symposium on Theory of Computing (STOC '96)*, Philadelphia, PA, May 1996.We present a new approach to designing data structures for the important problem of external-memory range searching in two and three dimensions. We construct data structures for answering range queries in I/O operations, where is the number of points in the data structure, is the I/O block size, and is the number of points in the answer to the query. We base our data structures on the novel concept of -approximate boundaries, which are manifolds that partition space into regions based on the output size of queries at points within the space.

Our data structures answer a longstanding open problem by providing three dimensional results comparable to those provided by Sairam and Ramaswamy for the two dimensional case, though completely new techniques are used. Ours is the first 3-D range search data structure that simultaneously achieves both a base- logarithmic search overhead (namely, ) and a fully blocked output component (namely, ). This gives us an overall I/O complexity extremely close to the well-known lower bound of . The space usage is more than linear by a logarithmic or polylogarithmic factor, depending on type of range search.

- R. D. Barve, E. F. Grove and J. S. Vitter.
“Simple Randomized Mergesort on Parallel Disks,”
special issue on
parallel I/O in
*Parallel Computing*,**23**(4), 1997. A shorter version appears in*Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96)*, Padua, Italy, June 1996, 109-118.We consider the problem of sorting a file of records on the -disk model of parallel I/O in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of contiguous records. In each I/O operation, up to one block can be transferred to or from each of the disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the “right” block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the “right” blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs.

The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values , , and . Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms DSM even when the number of parallel disks is small.

- G. Gibson, J. S. Vitter, and J. Wilkes, editors.
“Strategic Directions in Storage I/O Issues in Large-Scale
Computing,”
Strategic Directions in Computing Research,
*ACM Computing Surveys*,**28**(4), December 1996.We discuss the strategic directions and challenges in the management and use of storage systems--those components of computer systems responsible for the storage and retrieval of data. The performance gap between main and secondary memories shows no imminent sign of vanishing, and thus continuing research into storage I/O will be essential to reap the full benefit from the advances occurring in many other areas of computer science. In this report we identify a few strategic research goals and possible thrusts to meet those goals.

- D. E. Vengroff and J. S. Vitter.
“I/O-Efficient Algorithms and Environments,”
*ACM Computing Surveys*,**28**(4es), December 1996.There has recently been much productive work in the algorithms community on techniques for efficient use of external memory in large-scale applications. In order to implement I/O-optimal algorithms efficiently, the machines they run on must support fundamental external-memory operations. Unfortunately, existing file systems generally do not support the necessary semantics or provide useful tools. There are three basic approaches to supporting development of I/O-efficient code: array-oriented systems (such as PASSION and ViC*), access-oriented systems (such as the UNIX file system and Panda), and framework-oriented systems (such as TPIE, a Transparent Parallel I/O Programming Environment). In this position statement, we discuss the advantages and potential of the TPIE approach in enabling I/O-efficient computation.

General Terms: Algorithms, Design, Languages, Performance, Theory. Additional Key Words and Phrases: I/O, external memory, secondary memory, communication, disk drive, parallel disks, sorting.

- L. Arge, P. Ferragina, R. Grossi, and J. S. Vitter.
“On Sorting Strings in External Memory,”
*Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC '97)*, El Paso, TX, May 1997, 540-548, and in “Sequence Sorting in Secondary Storage,”*Proceedings of the 1997 International Conference on Compression and Complexity of Sequences (SEQUENCES '97)*, Positano, Italy, June 1997.In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many large-scale text applications. In the standard unit-cost RAM comparison model, the complexity of sorting strings of total length is . By analogy, in the external memory (or I/O) model, where the internal memory has size and the block transfer size is , it would be natural to guess that the I/O complexity of sorting strings is , but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we show that the I/O complexity of string sorting in this model is , where is the total length of all strings shorter than and is the number of strings longer than . We then consider two more general I/O comparison models in which string breaking is allowed. We obtain improved algorithms and in several cases lower bounds that match their I/O bounds. Finally, we develop more practical algorithms without assuming the comparison model.

- P. K. Agarwal, L. Arge, T. M. Murali, K. R. Varadarajan, and J. S. Vitter.
“I/O-Efficient Algorithms for Contour-line Extraction and
Planar Graph Blocking,”
*Proceedings of the 9th Annual SIAM/ACM Symposium on Discrete Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '98)*, San Francisco, CA, January 1998.For a polyhedral terrain, the contour at -coordinate is defined to be the intersection of the plane with the terrain. In this paper, we study the contour-line extraction problem, where we want to preprocess the terrain into a data structure so that given a query -coordinate , we can report the -contour quickly. This problem is central to geographic information systems (GIS), where terrains are often stored as Triangular Irregular Networks (TINs). We present an I/O-optimal algorithm for this problem which stores a terrain with vertices using blocks, where is the size of a disk block, so that for any query , the -contour can be computed using I/O operations, where denotes the size of the -contour.

We also present an improved algorithm for a more general problem of blocking bounded-degree planar graphs such as TINs (i.e., storing them on disk so that any graph traversal algorithm can traverse the graph in an I/O-efficient manner). We apply it to two problems that arise in GIS.

- L. A. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter.
“Theory and Practice of I/O-Efficient Algorithms
for Multidimensional Batched Searching Problems,”
*Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '98)*, San Francisco, CA, January 1998.We describe a powerful framework for designing efficient batch algorithms for certain large-scale dynamic problems that must be solved using external memory. The class of problems we consider, which we call

*colorable external-decomposable problems*, include rectangle intersection, orthogonal line segment intersection, range searching, and point location. We are particularly interested in these problems in two and higher dimensions. They have numerous applications in geographic information systems (GIS), spatial databases, and VLSI and CAD design. We present simplified algorithms for problems previously solved by more complicated approaches (such as rectangle intersection), and we present efficient algorithms for problems not previously solved in an efficient way (such as point location and higher-dimensional versions of range searching and rectangle intersection).We give experimental results concerning the running time for our approach applied to the red-blue rectangle intersection problem, which is a key component of the extremely important database operation spatial join. Our algorithm scales well with the problem size, and for large problems sizes it greatly outperforms the well-known sweepline approach.

- L. A. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, and J. S. Vitter.
“Scalable Sweep-Based Spatial Join,”
*Proceedings of the 1998 International Conference on Very Large Databases (VLDB '98)*, New York, August 1998, 570-581. This work forms the basis of two patent applications currently filed and pending with the patent office.In this paper, we examine the spatial join problem. In particular, we focus on the case when neither of the inputs is indexed. We present a new algorithm, Scalable Sweep-based Spatial Join (SSSJ), that is based on the distribution-sweeping technique recently proposed in computational geometry, and that is the first to achieve theoretically optimal bounds on internal computation time as well as I/O transfers. We present experimental results based on an efficient implementation of the SSSJ algorithm, and compare it to the state-of-the-art Partition-Based Spatial-Merge (PBSM) algorithm of Patel and DeWitt.

Our SSSJ algorithm performs an initial sorting step along the vertical axis, after which we use the distribution-sweeping technique to partition the input into a number of vertical strips, such that the data in each strip can be efficiently processed by an internal-memory sweepline algorithm. A key observation that allowed us to greatly improve the practical performance of our algorithm is that in most sweepline algorithms not all input data is needed in main memory at the same time. In our initial experiments, we observed that on real-life two-dimensional spatial data sets of size , the internal-memory sweepline algorithm requires only memory space. This behavior (also known as the square-root rule in the VLSI literature) implies that for real-life two-dimensional data sets, we can bypass the vertical partitioning step and directly perform the sweepline algorithm after the initial external sorting step. We implemented SSSJ such that partitioning is only done when it is detected that that the sweepline algorithm exhausts the internal memory. This results in an algorithm that not only is extremely efficient for real-life data but also offers guaranteed worst-case bounds and predictable behavior on skewed and/or bad input data: Our experiments show that SSSJ performs at least better than PBSM on real-life data sets, and that it robustly handles skewed data on which PBSM suffers a serious performance degeneration.

As part of our experimental work we experimented with a number of different techniques for performing the internal sweepline. By using an efficient partitioning heuristic, we were able to speed up the internal sweeping used by PBSM by a factor of over on the average for real-life data sets. The resulting improved PBSM then performs approximately better than SSSJ on the real-life data we used, and it is thus a good choice of algorithm when the data is known not to be too skewed.

- R. D. Barve, M. Kallahalla, P. J. Varman, and J. S. Vitter.
“Competitive Analysis of Buffer Management Algorithms
for Parallel I/O Systems,”
*Journal of Algorithms*,**36**, August 2000. A shorter version appeared as “Competitive Parallel Disk Prefetching and Buffer Management,”*Proceedings of the Fifth Annual Workshop on I/O in Parallel and Distributed Systems (IOPADS), 1998*, San Jose, California.We provide a competitive analysis framework for online prefetching and buffer management algorithms in parallel I/O systems, using a read-once model of block references. This has widespread applicability to key I/O-bound applications such as external merging and concurrent playback of multiple video streams. Two realistic lookahead models, global lookahead and local lookahead, are defined. Algorithms NOM and GREED based on these two forms of lookahead are analyzed for shared buffer and distributed buffer configurations, both of which occur frequently in existing systems. An important aspect of our work is that we show how to implement both the models of lookahead in practice using the simple techniques of forecasting and flushing.

Given a -disk parallel I/O system and a globally shared I/O buffer that can hold up to disk blocks, we derive a lower bound of on the competitive ratio of

*any*deterministic online prefetching algorithm with lookahead. NOM is shown to match the lower bound using global -block lookahead. In contrast, using only local lookahead results in an competitive ratio. When the buffer is distributed into portions of blocks each, the algorithm GREED based on local lookahead is shown to be optimal, and NOM is within a constant factor of optimal. Thus we provide a theoretical basis for the intuition that global lookahead is more valuable for prefetching in the case of a shared buffer configuration whereas it is enough to provide local lookahead in case of the distributed configuration. Finally, we analyze the performance of these algorithms for reference strings generated by a uniformly-random stochastic process and we show that they achieve the minimal expected number of I/Os. These results also give bounds on the worst-case expected performance of algorithms which employ randomization in the data layout. - R. D. Barve, E. A. M. Shriver, P. Gibbons, B. Hillyer, Y. Matias, and J. S. Vitter.
“Modeling and Optimizing I/O
Throughput of Multiple Disks on a Bus,”
*Proceedings of the Joint International ACM Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '99)*, Atlanta, GA, May 1999, 83-92. An abstract appears in*Proceedings of the Joint ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS '98/PERFORMANCE '98)*, Madison, WI, June 1998, 264-265. Our modeling and optimization work forms the basis of two patent applications currently filed and pending with the patent office.Slides for talk (gzip-compressed postscript)

For a wide variety of computational tasks, disk I/O continues to be a serious obstacle to high performance. To meet demanding I/O requirements, systems are designed to use multiple disk drives that share one or more I/O ports to form a disk farm or RAID array. The focus of the present paper is on systems that use multiple disks per SCSI bus. We measured the performance of concurrent random I/Os for three types of SCSI disk drives and three types of computers. The measurements enable us to study bus-related phenomena that impair performance. We describe these phenomena, and present a new I/O performance model that incorporates bus effects to predict the average throughput achieved by concurrent random I/Os that share a SCSI bus. This model, although relatively simple, predicts performance on these platforms to within 11% for fixed I/O sizes in the range 16-128 KB/s. We then describe a technique to improve the I/O throughput. This technique increases the percentage of disk head positioning time that is overlapped with data transfers, and increases the percentage of transfers that occur at bus bandwidth, rather than at disk-head bandwidth. Our technique is most effective for large I/Os and high concurrency--an important performance region for large-scale computing--our improvements are 10-20% better than the naive method for random workloads.

- J. S. Vitter, M. Wang, and B. Iyer.
“Data Cube Approximation and Histograms via Wavelets,”
*Proceedings of Seventh International Conference on Information and Knowledge Management (CIKM'98)*, Washington D.C., November 1998.There has recently been an explosion of interest in the analysis of data in data warehouses in the field of On-Line Analytical Processing (OLAP). Data warehouses can be extremely large, yet obtaining quick answers to queries is important. In many situations, obtaining the exact answer to an OLAP query is prohibitively expensive in terms of time and/or storage space. It can be advantageous to have fast, approximate answers to queries.

In this paper, we present an I/O-efficient technique based upon a multiresolution wavelet decomposition that yields an approximate and space-efficient representation of the data cube, which is one of the core OLAP operators. We build our compact data cube on the logarithms of the partial sums of the raw data values of a multidimensional array. We get excellent approximations for on-line range-sum queries with limited space usage and computational cost. Multiple data cubes can be handled simultaneously. Each query can generally be answered, depending upon the accuracy supported, in one I/O or a small number of I/Os. Experiments show that our method performs significantly better than other approximation techniques such as histograms and random sampling.

- P. K. Agarwal, L. A. Arge, J. Erickson, P. G. Franciosa, and J. S. Vitter.
“Efficient Searching with Linear Constraints,”
*Journal of Computer and System Sciences*,**61**(2), October 2000, 194-216. An extended abstract appears in*Proceedings of the 17th Annual ACM Symposium on Principles of Database Systems (PODS '98)*, Seattle, WA, June 1998, 169-178.We show how to preprocess a set of points in -dimensional Euclidean space to get an external memory data structure that efficiently supports linear-constraint queries. Each query is in the form of a linear constraint ; the data structure must report all the points of that satisfy the query. (This problem is called halfspace range searching in the computational geometry literature.) Our goal is to minimize the number of disk blocks required to store the data structure and the number of disk accesses (I/Os) required to answer a query. For , we present the first near-linear size data structures that can answer linear-constraint queries using an optimal number of I/Os. We also present a linear-size data structure that can answer queries efficiently in the worst case. We combine these two approaches to obtain tradeoffs between space and query time. Finally, we show that some of our techniques extend to higher dimensions.

- Y. Matias, J. S. Vitter, and M. Wang.
“Wavelet-Based Histograms for Selectivity Estimation,”
*Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98)*, Seattle, Washington, June 1998.Query optimization is an integral part of relational database management systems. One important task in query optimization is selectivity estimation, that is, given a query , we need to estimate the fraction of records in the database that satisfy . Many commercial database systems maintain histograms to approximate the frequency distribution of values in the attributes of relations.

In this paper, we present a technique based upon a multiresolution wavelet decomposition for building histograms on the underlying data distributions, with applications to databases, statistics, and simulation. Histograms built on the cumulative data distributions give very good approximations with limited space usage. We give fast algorithms for constructing histograms and using them in an on-line fashion for selectivity estimation. Our histograms also provide quick approximate answers to OLAP queries when the exact answers are not required. Our method captures the joint distribution of multiple attributes effectively, even when the attributes are correlated. Experiments confirm that our histograms offer substantial improvements in accuracy over random sampling and other previous approaches.

- J. S. Vitter.
“External Memory Algorithms and Data Structures:
Dealing with Massive Data,”
*ACM Computing Surveys*,**33**(2), June 2001, 209-271. Shorter versions appear in “External Memory Algorithms,” invited keynote paper in*Proceedings of the 6th Annual European Symposium on Algorithms (ESA '98)*, Venice, August 1998, published in Lecture Notes in Computer Science,**1461**, Springer, Berlin, Germany, 1-25, and in an invited tutorial in*Proceedings of the 17th Annual ACM Symposium on Principles of Database Systems (PODS '98)*, Seattle, WA, June 1998, 119-128.Slides for a talk (Adobe pdf format)

This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.

- L. Arge, K. H. Hinrichs, J. Vahrenhold, and J. S. Vitter.
“Efficient Bulk Operations on Dynamic R-trees,”
special issue on experimental algorithmics in
*Algorithmica*,**33**(1), May 2002, 104-128. A shorter version appears in*Proceedings of the 1st Workshop on Algorithm Engineering and Experimentation (ALENEX '99)*, Baltimore, MD, January 1999.In recent years there has been an upsurge of interest in spatial databases. A major issue is how to efficiently manipulate massive amounts of spatial data stored on disk in multidimensional

*spatial indexes*(data structures). Construction of spatial indexes (*bulk loading*) has been researched intensively in the database community. The continuous arrival of massive amounts of new data make it important to efficiently update existing indexes (*bulk updating*).In this article we present a simple technique for performing bulk update and query operations on multidimensional indexes. We present our technique in terms of the so-called R-tree and its variants, as they have emerged as practically efficient indexing methods for spatial data. Our method uses ideas from the buffer tree lazy buffering technique and fully utilizes the available internal memory and the page size of the operating system. We give a theoretical analysis of our technique, showing that it is efficient both in terms of I/O communication, disk storage, and internal computation time. We also present the results of an extensive set of experiments showing that in practice our approach performs better than the previously best known bulk update methods with respect to update time, and that it produces a better quality index in terms of query performance. One important novel feature of our technique is that in most cases it allows us to perform a batch of updates and queries simultaneously. To be able to do so is essential in environments where queries have to be answered even while the index is being updated and reorganized.

- P. K. Agarwal, L. Arge, G. Brodal, and J. S. Vitter.
“I/O-Efficient Dynamic Point Location in Monotone
Subdivisions,”
*Proceedings of the 10th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '99)*, Baltimore, MD, January 1999.We present an efficient external-memory dynamic data structure for point location in monotone planar subdivisions. Our data structure uses disk blocks to store a monotone subdivision of size , where is the size of a disk block. It supports queries in I/Os (worst-case) and updates in I/Os (amortized).

We also propose a new variant of -trees, called level-balanced -trees, which allow insert, delete, merge, and split operations in I/Os (amortized), , even if each node stores a pointer to its parent. Here is the size of main memory. Besides being essential to our point-location data structure, we believe that

*level-balanced B-trees*are of significant independent interest. They can, for example, be used to dynamically maintain a planar st-graph using I/Os (amortized) per update, so that reachability queries can be answered in I/Os (worst case). - J. S. Vitter and M. Wang.
“Approximate Computation of
Multidimensional Aggregates of Sparse Data Using Wavelets,”
*Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99)*, Philadelphia, PA, June 1999.Computing multidimensional aggregates in high dimensions is a performance bottleneck for many OLAP applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in a data warehouse environment. It is advantageous to have fast, approximate answers to OLAP aggregation queries.

In this paper, we present a novel method that provides approximate answers to high-dimensional OLAP aggregation queries in massive sparse data sets in a time-efficient and space-efficient manner. We construct a compact data cube, which is an approximate and space-efficient representation of the underlying multidimensional array, based upon a multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.

We present two I/O-efficient algorithms to construct the compact data cube for the important case of sparse high-dimensional arrays, which often arise in practice. The traditional histogram methods are infeasible for the massive high-dimensional data sets in OLAP applications. Previously developed wavelet techniques are efficient only for dense data. Our on-line query processing algorithm is very fast and capable of refining answers as the user demands more accuracy. Experiments on real data show that our method provides significantly more accurate results for typical OLAP aggregation queries than other efficient approximation techniques such as random sampling.

- L. Arge, V. Samoladas, and J. S. Vitter.
“On Two-Dimensional Indexability and Optimal Range Search
Indexing,”
*Proceedings of the 18th Annual ACM Symposium on Principles of Database Systems (PODS '99)*, Philadelphia, PA, May-June 1999.In this paper we settle several longstanding open problems in theory of indexability and external orthogonal range searching. In the first part of the paper, we apply the theory of indexability to the problem of two-dimensional range searching. We show that the special case of 3-sided querying can be solved with constant redundancy and access overhead. From this, we derive indexing schemes for general 4-sided range queries that exhibit an optimal tradeoff between redundancy and access overhead.

In the second part of the paper, we develop dynamic external memory data structures for the two query types. Our structure for 3-sided queries occupies disk blocks, and it supports insertions and deletions in I/Os and queries in I/Os, where is the disk block size, is the number of points, and is the query output size. These bounds are optimal. Our structure for general (4-sided) range searching occupies disk blocks and answers queries in I/Os, which are optimal. It also supports updates in I/Os.

- R. D. Barve and J. S. Vitter.
“A Simple and Efficient Parallel Disk
Mergesort,”
*Theory of Computing Systems*,**35**(2), March/April 2002, 189-215. A shorter version appears in*Proceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '99)*, St. Malo, France, June 1999.Slides for talk plus extra foils on dynamic memory allocation (gzip-compressed postscript)

External sorting is a fundamental operation in many large scale data processing systems not only for producing sorted output but also as a core subroutine in many operations. Technology trends indicate that developing techniques that effectively use multiple disks in parallel in order to speed up the performance of external sorting is of prime importance. The simple randomized merging (SRM) mergesort algorithm proposed in our earlier work is the first parallel disk sorting algorithm that requires a provably optimal number of passes and that is fast in practice. Knuth (in the new edition of

*The Art of Computer Programming*, Vol. 3:*Sorting and Searching*) recently identified SRM (which he calls “randomized striping”) as the method of choice for sorting with parallel disks.In this paper, we present an efficient implementation of SRM, based upon novel data structures. We give a new implementation for SRM's lookahead forecasting technique for parallel prefetching and its forecast and flush technique for buffer management. Our techniques amount to a significant improvement in the way SRM carries out the parallel, independent disk accesses necessary to efficiently read blocks of input runs during external merging.

We present the performance of SRM over a wide range of input sizes and compare its performance with that of disk-striped mergesort (DSM), the commonly used technique to implement external mergesort on parallel disks. DSM consists of using a standard mergesort algorithm in conjunction with striped I/O for parallel disk access. SRM merges together significantly more runs at a time compared with DSM, and thus it requires fewer merge passes. We demonstrate in practical scenarios that even though the streaming speeds for merging with DSM are a little higher than those for SRM (since DSM merges fewer runs at a time), sorting using SRM is significantly faster than with DSM, since SRM requires fewer passes.

The techniques in this paper can be generalized to meet the load-balancing requirements of other applications using parallel disks, including distribution sort, multiway partitioning of a file into several other files. and some potential multimedia streaming applications.

- J. S. Vitter.
“Online Data Structures in External Memory,”
invited keynote paper in
*Proceedings of the 26th Annual International Colloquium on Automata, Languages, and Programming (ICALP '99)*, Prague, Czech Republic, July 1999, published in Lecture Notes in Computer Science,**1644**, Springer-Verlag, Berlin, 119-133.This survey article is superseded by a more comprehensive book. The book is available online and is recommended as the preferable reference.

Slides for ICALP '99 talk (gzip-compressed postscript)

The data sets for many of today's computer applications are too large to fit within the computer's internal memory and must instead be stored on external storage devices such as disks. A major performance bottleneck can be the input/output communication (or I/O) between the external and internal memories. In this paper we discuss a variety of online data structures for external memory, some very old and some very new, such as hashing (for dictionaries), B-trees (for dictionaries and 1-D range search), buffer trees (for batched dynamic problems), interval trees with weight-balanced B-trees (for stabbing queries), priority search trees (for 3-sided 2-D range search), and R-trees and other spatial structures. We also discuss several open problems along the way.

- R. D. Barve and J. S. Vitter.
“External Memory Algorithms with
Dynamically Changing Memory Allocations,”
Report CS-1998-09, June 1998. A shorter version of the paper
appeared as
“A Theoretical Framework for Memory-Adaptive Algorithms,”
*Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (FOCS '99)*, New York, NY, October 1999, 273-284.We consider the problem of devising external memory algorithms whose memory allocations can change dynamically and unpredictably at run-time. The investigation of “memory-adaptive” algorithms, which are designed to adapt to dynamically changing memory allocations, can be considered a natural extension of the investigation of traditional, non-adaptive external memory algorithms. Our study is motivated by high performance database systems and operating systems in which applications are prioritized and internal memory is dynamically allocated in accordance with the priorities. In such situations, external memory applications are expected to perform as well as possible for the current memory allocation. The computation must be reorganized to adapt to the sequence of memory allocations in an online manner.

In this paper we present a simple and natural dynamic memory allocation model. We define memory-adaptive external memory algorithms and specify what is needed for them to be dynamically optimal. Using novel techniques, we design and analyze dynamically optimal memory-adaptive algorithms for the problems of sorting, permuting, FFT, permutation networks, (standard) matrix multiplication and LU decomposition. We also present a dynamically optimal (in an amortized sense) memory-adaptive version of the buffer tree, a generic external memory data structure for a large number of batched dynamic applications. We show that a previously devised approach to memory-adaptive external mergesort is provably nonoptimal because of fundamental drawbacks. The lower bound proof techniques for sorting and matrix multiplication are fundamentally distinct techniques, and they are invoked by most other external memory lower bounds; hence we anticipate that the techniques presented here will apply to many external memory problems.

- L. Arge, O. Procopiuc, S. Ramaswamy, T. Suel, J. Vahrenhold, and J. S. Vitter.
“A Unified Approach for Indexed and Non-Indexed Spatial
Joins,”
*Proceedings of the 7th International Conference on Extending Database Technology (EDBT '00)*, Konstanz, Germany, March 2000.Most spatial join algorithms either assume the existence of a spatial index structure that is traversed during the join process, or solve the problem by sorting, partitioning, or on-the-fly index construction. In this paper, we develop a simple plane-sweeping algorithm that unifies the index-based and non-index based approaches. This algorithm processes indexed as well as non-indexed inputs, extends naturally to multi-way joins, and can be built easily from a few standard operations. We present the results of a comparative study of the new algorithm with several index-based and non-index based spatial join algorithms. We consider a number of factors, including the relative performance of CPU and disk, the quality of the spatial indexes, and the sizes of the input relations. An important conclusion from our work is that using an index-based approach whenever indexes are available does not always lead to the best execution time, and hence we propose the use of a simple cost model to decide when to follow an index-based approach.

- L. Arge, L. Toma, and J. S. Vitter.
“I/O-Efficient Algorithms for Problems on Grid-Based
Terrains”
*ACM Journal of Experimental Algorithmics*,**6**(1), 2001. An extended abstract appears in*Proceedings of the 2nd Workshop on Algorithm Engineering and Experimentation (ALENEX '00)*, San Francisco, CA, January 2000.The potential and use of Geographic Information Systems (GIS) is rapidly increasing due to the increasing availability of massive amounts of geospatial data from projects like NASA's Mission to Planet Earth. However, the use of these massive datasets also exposes scalability problems with existing GIS algorithms. These scalability problems are mainly due to the fact that most GIS algorithms have been designed to minimize internal computation time, while I/O communication often is the bottleneck when processing massive amounts of data.

In this paper, we consider I/O-efficient algorithms for problems on grid-based terrains. Detailed grid-based terrain data is rapidly becoming available for much of the earth's surface. We describe I/O algorithms for several problems on by grids for which only algorithms were previously known. Here denotes the size of the main memory and the size of a disk block.

We demonstrate the practical merits of our work by comparing the empirical performance of our new algorithm for the flow accumulation problem with that of the previously best known algorithm. Flow accumulation, which models flow of water through a terrain, is one of the most basic hydrologic attributes of a terrain. We present the results of an extensive set of experiments on real-life terrain datasets of different sizes and characteristics. Our experiments show that while our new algorithm scales nicely with dataset size, the previously known algorithm “breaks down” once the size of the dataset becomes bigger than the available main memory. For example, while our algorithm computes the flow accumulation for the Appalachian Mountains in about three hours, the previously known algorithm takes several weeks.

- Y. Matias, E. Segal, and J. S. Vitter.
“Efficient Bundle Sorting,”
*SIAM Journal on Computing*,**36**(2), 2006, 394-410. A shorter version appeared in*Proceedings of the 11th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '00)*, San Francisco, CA, January 2000, 839-848.Many data sets to be sorted consist of a limited number of distinct keys. Sorting such data sets can be thought of as bundling together identical keys and having the bundles placed in order; we therefore denote this as

*bundle sorting*. We describe an efficient algorithm for bundle sorting in external memory that requires at most disk accesses, where is the number of keys, is the size of internal memory, is the number of distinct keys, is the transfer block size, and . For moderately sized , this bound circumvents the I/O lower bound known for general sorting. We show that our algorithm is optimal by proving a matching lower bound for bundle sorting. The improved running time of bundle sorting over general sorting can be significant in practice, as demonstrated by experimentation. An important feature of the new algorithm is that it is executed “in-place”, requiring no additional disk space. - A. Natsev, J. R. Smith, Y. C. Chang, C.-S. Li, and J. S. Vitter.
“Constrained Querying of Multimedia Databases: Issues and
Approaches,”
*Proceedings of SPIE Electronic Imaging 2001: Storage and Retrieval for Image and Video Databases*, San Jose, CA, January 2001.This paper investigates the problem of high-level querying of multimedia data by imposing arbitrary domain-specific constraints among multimedia objects. We argue that the current structured query model, and the query-by-content model, are insufficient for many important applications, and we propose an alternative query framework that unifies and extends the previous two models. The proposed framework is based on the querying-by-concept paradigm, where the query is expressed simply in terms of concepts, regardless of the complexity of the underlying multimedia search engines. The query-by-concept paradigm was previously illustrated by the CAMEL system. The present paper builds upon and extends that work by adding arbitrary constraints and multiple levels of hierarchy in the concept representation model.

We consider queries simply as descriptions of virtual data sets, and that allows us to use the same unifying concept representation for query specification, as well as for data annotation purposes. We also identify some key issues and challenges presented by the new framework, and we outline possible approaches for overcoming them. In particular, we study the problems of concept representation, extraction, refinement, storage, and matching.

- A. Natsev, A. Chadha, B. Soetarman, and J. S. Vitter.
“CAMEL: Concept Annotated iMagE Libraries,”
*Proceedings of SPIE Electronic Imaging 2001: Storage and Retrieval for Image and Video Databases*, San Jose, CA, January 2001.The problem of content-based image searching has received considerable attention in the last few years. Thousands of images are now available on the internet, and many important applications require searching of images in domains such as E-commerce, medical imaging, weather prediction, satellite imagery, and so on. Yet, content-based image querying is still largely unestablished as a mainstream field, nor is it widely used by search engines. We believe that two of the major hurdles for this poor acceptance are poor retrieval quality and usability.

In this paper, we introduce the CAMEL system--an acronym for Concept Annotated iMagE Libraries--as an effort to address both of the above problems. The CAMEL system provides and easy-to-use, and yet powerful, text-only query interface, which allows users to search for images based on

*visual concepts*, identified by specifying relevant keywords. Conceptually, CAMEL annotates images with the visual concepts that are relevant to them. In practice, CAMEL defines visual concepts by looking at sample images off-line and extracting their relevant visual features. Once defined, such visual concepts can be used to search for relevant images on the fly, using content-based search methods. The visual concepts are stored in a Concept Library and are represented by an associated set of wavelet features, which in our implementation were extracted by the WALRUS image querying system. Even though the CAMEL framework applies independently of the underlying query engine, for our prototype we have chosen WALRUS as a back-end, due to its ability to extract and query with image region features.CAMEL improves retrieval quality because it allows experts to build very accurate representations of visual concepts that can be used even by novice users. At the same time, CAMEL improves usability by supporting the familiar text-only interface currently used by most search engines on the web. Both improvements represent a departure from traditional approaches to improving image query systems--instead of focusing on

**query execution**, we emphasize**query specification**by allowing simpler and yet more precise query specification. - P. K. Agarwal, L. Arge, O. Procopiuc, and J. S. Vitter.
“A Framework for Index Bulk Loading and Dynamization,”
*Proceedings of the 28th Annual International Colloquium on Automata, Languages, and Programming (ICALP '01)*, Crete, Greece, July 2001, published in Lecture Notes in Computer Science,**2076**, Springer-Verlag, Berlin, Germany.We investigate automated methods for externalizing internal memory data structures. We consider a class of balanced trees that we call weight-balanced partitioning trees (or wp-trees) for indexing a set of points in -dimensional space. Well-known examples of wp-trees include d-trees, BBD-trees, pseudo quad trees, and BAR trees. These trees are defined with fixed degree and are thus suited for internal memory implementations. Given an efficient wp-tree construction algorithm, we present a general framework for automatically obtaining a new dynamic external tree data structure. Using this framework together with a new general construction (bulk loading) technique of independent interest, we obtain data structures with guaranteed good update performance in terms of I/O transfers. Our approach gives considerably improved construction and update I/O bounds of d-trees and BBD trees.

- D. A. Hutchinson, P. Sanders, and J. S. Vitter.
“Duality Between Prefetching and Queued Writing
with Parallel Disks,”
*SIAM Journal on Computing*,**34**(6), 1443-1463, June 2005. An extended abstract appears in*Proceedings of the 9th Annual European Symposium on Algorithms (ESA '01)*, Århus, Denmark, August 2001, published in Lecture Notes in Computer Science,**2161**, Springer-Verlag, Berlin, Germany.Parallel disks promise to be a cost effective means for achieving high bandwidth in applications involving massive data sets, but algorithms for parallel disks can be difficult to devise. To combat this problem, we define a useful and natural duality between writing to parallel disks and the seemingly more difficult problem of prefetching. We first explore this duality for applications involving read-once accesses using parallel disks. We get a simple linear time algorithm for computing optimal prefetch schedules and analyze the efficiency of the resulting schedules for randomly placed data and for arbitrary interleaved accesses to striped sequences. Duality also provides an optimal schedule for the integrated caching and prefetching problem, in which blocks can be accessed multiple times. Another application of this duality gives us the first parallel disk sorting algorithms that are provably optimal up to lower order terms. One of these algorithms is a simple and practical variant of multiway merge sort, addressing a question that has been open for some time.

- A. Natsev, Y. C. Chang, J. R. Smith, C.-S. Li, and J. S. Vitter,
“Supporting Incremental Join Queries on Ranked Inputs,”
*Proceedings of the 27th International Conference on Very Large Databases (VLDB '01)*, Rome, Italy, September 2001.This paper investigates the problem of incremental joins of multiple ranked data sets when the join condition is a list of arbitrary user-defined predicates on the input tuples. This problem arises in many important applications dealing with ordered inputs and multiple ranked data sets, and requiring the top solutions. We use multimedia applications as the motivating examples but the problem is equally applicable to traditional database applications involving optimal resource allocation, scheduling, decision making, ranking, etc.

We propose an algorithm that enables querying of ordered data sets by imposing arbitrary

*user-defined join predicates*. The basic version of the algorithm does not use any random access but a variation can exploit available indexes for efficient random access based on the join predicates. A special case includes the join scenario considered by Fagin for joins based on identical keys, and in that case, our algorithms perform as efficiently as Fagin's. Our main contribution, however, is the generalization to join scenarios that were previously unsupported, including cases where random access in the algorithm is not possible due to lack of unique keys. In addition, can support*multiple join levels*, or nested join hierarchies, which are the norm for modeling multimedia data. We also give -approximation versions of both of the above algorithms. Finally, we give strong optimality results for some of the proposed algorithms, and we study their performance empirically. - A. Natsev, G. Fuh, W. Chen, C.-H. Chiu, and J. S. Vitter.
“Aggregate Predicate Support in DBMS,”
*Proceedings of the 13th Australasian Database Conference (ADC '02)*, Melbourne, Australia, January 2002, published in Conferences in Research and Practice in Information Technology, Vol. 5, Xiaofang Zhou, Ed.In this paper we consider aggregate predicates and their support in database systems. Aggregate predicates are the predicate equivalent to aggregate functions in that they can be used to search for tuples that satisfy some aggregate property over a set of tuples (as opposed to simply computing an aggregate property over a set of tuples). The importance of aggregate predicates is exemplified by many modern applications that require ranked search, or top- queries. Such queries are the norm in multimedia and spatial databases.

In order to support the concept of aggregate predicates in DBMS, we introduce several extensions in the query language and the database engine. Specifically, we extend the SQL syntax to handle aggregate predicates and work out the semantics of such extensions so that they behave correctly in the existing database model. We also propose a a new rk_SORT operator into the database engine, and study relevant indexing and query optimization issues.

Our approach provides several advantages, including enhanced usability and improved performance. By supporting aggregate predicates natively in the database engine, we are able to reuse existing indexing and query optimization techniques, without sacrificing generality or incurring the runtime overhead of database-external approaches. To the best of our knowledge, the proposed framework is the first to support user-defined indexing with aggregate predicates and search based upon user-defined ranking. We also provide empirical results from a simulation study that validates the effectiveness of our approach.

- L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Parr.
“XPathLearner: An On-Line Self-Tuning Markov Histogram for XML Path
Selectivity Estimation,”
*Proceedings of the 28th International Conference on Very Large Databases (VLDB '02)*, Hong Kong, China, August 2002.The extensible mark-up language (XML) is gaining widespread use as a format for data exchange and storage on the World Wide Web. Queries over XML data require accurate selectivity estimation of path expressions to optimize query execution plans. Selectivity estimation of XML path expression is usually done based on summary statistics about the structure of the underlying XML repository. All previous methods require an off-line scan of the XML repository to collect the statistics.

In this paper, we propose XPathLearner, a method for estimating selectivity of the most commonly used types of path expressions without looking at the XML data. XPathLearner gathers and refines the statistics using query feedback in an on-line manner and is especially suited to queries in Internet scale applications since the underlying XML repositories are likely to be inaccessible or too large to be scanned entirely. Besides the on-line property, our method also has two other novel features: (a) XPathLearner is workload aware in collecting the statistics and thus can be dramatically more accurate than the more costly off-line method under tight memory constraints, and (b) XPathLearner automatically adjusts the statistics using query feedback when the underlying XML data change. We show empirically the estimation accuracy of our method using several real data sets.

- L. Arge, O. Procopiuc, and J. S. Vitter.
“Implementing I/O-Efficient Data Structures Using
TPIE,”
*Proceedings of the 10th Annual European Symposium on Algorithms (ESA '02)*, Rome, Italy, September 2002, published in Lecture Notes in Computer Science, Springer-Verlag,**2461**, Berlin, Germany, 88-100.In recent years, many theoretically I/O-efficient algorithms and data structures have been developed. The TPIE project at Duke University was started to investigate the practical importance of these theoretical results. The goal of this ongoing project is to provide a portable, extensible, flexible, and easy to use C++ programming environment for efficiently implementing I/O-algorithms and data structures. The TPIE library has been developed in two phases. The first phase focused on supporting algorithms with a sequential I/O pattern, while the recently developed second phase has focused on supporting on-line I/O-efficient data structures, which exhibit a more random I/O pattern. This paper describes the design and implementation of the second phase of TPIE.

- L. Arge and J. S. Vitter.
“Optimal External Memory Interval Management,”
*SIAM Journal on Computing*,**32**(6), 2003, 1488-1508. An extended abstract appears in “Optimal Dynamic Interval Management in External Memory,”*Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science (FOCS '96)*, Burlington, VT, October 1996, 560-569. Also appeared in Abstracts of the 1st CGC Workshop on Computational Geometry, Center for Geometric Computing, Johns Hopkins University, Baltimore, MD, October 1996.We present a space- and I/O-optimal external-memory data structure for answering stabbing queries on a set of dynamically maintained intervals. Our data structure settles an open problem in databases and I/O algorithms by providing the first optimal external-memory solution to the dynamic interval management problem, which is a special case of 2-dimensional range searching and a central problem for object-oriented and temporal databases and for constraint logic programming. Our data structure simultaneously uses optimal linear space (that is, blocks of disk space) and achieves the optimal I/O query bound and I/O update bound, where is the I/O block size and the number of elements in the answer to a query. Our structure is also the first optimal external data structure for a 2-dimensional range searching problem that has worst-case as opposed to amortized update bounds. Part of the data structure uses a novel balancing technique for efficient worst-case manipulation of balanced trees, which is of independent interest.

- M. Wang, B. Iyer, and J. S. Vitter.
“Scalable Mining for Classification Rules in Relational
Databases,”
*Herman Rubin Festschrift*, Lecture Notes Monograph Series,**45**, Institute of Mathematical Statistics, Hayward, CA, Fall 2004.Classification is a key function of many “business intelligence” toolkits and a fundamental building block in data mining. Immense data may be needed to train a classifier for good accuracy. The state-of-art classifiers need an in-memory data structure of size , where is the size of the training data, to achieve efficiency. For large data sets, such a data structure will not fit in the internal memory. The best previously known classifier does a quadratic number of I/Os for large .

In this paper, we propose a novel classification algorithm (classifier) called MIND (MINing in Databases). MIND can be phrased in such a way that its implementation is very easy using the extended relational calculus SQL, and this in turn allows the classifier to be built into a relational database system directly. MIND is truly scalable with respect to I/O efficiency, which is important since scalability is a key requirement for any data mining algorithm.

We built a prototype of MIND in the relational database manager DB2 and benchmarked its performance. We describe the working prototype and report the measured performance with respect to the previous method of choice. MIND scales not only with the size of the datasets but also with the number of processors on an IBM SP2 computer system. Even on uniprocessors, MIND scales well beyond the dataset sizes previously published for classifiers. We also give some insights that may have an impact on the evolution of the extended relational calculus SQL.

- L. Arge, J. S. Chase, L. Toma, J. S. Vitter, R. Wickremesinghe, P. Halpin, and D. Urban,
“Efficient Flow Computation on Massive Grid Terrain
Datasets,”
*Geoinformatica*,**7**(4), December 2003, 283-313. An extended abstract appears in “Flow Computation on Massive Grids,”*Proceedings of the 9th ACM International Symposium on Advances in Geographic Information Systems (ACM-GIS '01)*Atlanta, GA, November 2001, 82-87.As detailed terrain data becomes available, GIS terrain applications target larger geographic areas at finer resolutions. Processing the massive data involved in such applications presents significant challenges to GIS systems and demands algorithms that are optimized both for data movement and computation. In this paper we develop efficient algorithms for flow routing on massive terrains, extending our previous work on flow accumulation. We have implemented these algorithms in the Terraflow system, which is the first comprehensive terrain flow software system designed and optimized for massive data. We compare the performance of Terraflow with that of state of the art commercial and open-source GIS systems. On large terrains, Terraflow outperforms existing systems by a factor of 2 to 1000, and is capable of solving problems no system was previously able to solve.

- J. S. Vitter and D. A. Hutchinson.
“Distribution Sort with Randomized Cycling,”
*Journal of the ACM*,**53**(7), July 2006, 656-680. A shorter version appeared in*Proceedings of the 12th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '01)*, Washington, DC, January 2001.Parallel independent disks can enhance the performance of external memory (EM) algorithms, but the programming task is often difficult. In this paper we develop randomized variants of distribution sort for use with parallel independent disks. We propose a simple variant called randomized cycling distribution sort (RCD) and prove that it has optimal expected I/O complexity. The analysis uses a novel reduction to a model with significantly fewer probabilistic interdependencies. Experimental evidence is provided to support its practicality. Other simple variants are also examined experimentally and appear to offer similar advantages to RCD. Based upon ideas in RCD we propose general techniques that transparently simulate algorithms developed for the unrealistic multihead disk model so that they can be run on the realistic parallel disk model. The simulation is optimal for two important classes of algorithms: the class of multipass algorithms, which make a complete pass through their data before accessing any element a second time, and the algorithms based upon the well-known distribution paradigm of EM computation.

- L. Lim, M. Wang, and J. S. Vitter.
“SASH: A Self-Adaptive Histogram Set
for Dynamically Changing Workloads,”
*Proceedings of the 29th International Conference on Very Large Databases (VLDB '03)*, Berlin, Germany, September 2003.Most RDBMSs maintain a set of histograms for estimating the selectivities of given queries. These selectivities are typically used for cost-based query optimization. While the problem of building an accurate histogram for a given attribute or attribute set has been well-studied, little attention has been given to the problem of building and tuning a set of histograms collectively for multidimensional queries in a self-managed manner based only on query feedback.

In this paper, we present SASH, a Self-Adaptive Set of Histograms that addresses the problem of building and maintaining a set of histograms. SASH uses a novel two-phase method to automatically build and maintain itself using query feedback information only. In the online tuning phase, the current set of histograms is tuned in response to the estimation error of each query in an online manner. In the restructuring phase, a new and more accurate set of histograms replaces the current set of histograms. The new set of histograms (attribute sets and memory distribution) is found using information from a batch of query feedback. We present experimental results that show the effectiveness and accuracy of our approach.

- R. Grossi and J. S. Vitter.
“Compressed Suffix Arrays and Suffix Trees, with Applications to
Text Indexing and String Matching,”
*SIAM Journal on Computing*,**35**(2), 2005, 378-407. An extended abstract of the first appears in*Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC '00)*, Portland, OR, May 2000, 397-406.Slides for talk (Adobe pdf format)

The proliferation of online text, such as on the World Wide Web and in databases, motivates the need for space-efficient index methods that support fast search. Consider a text of binary symbols to index. Given any query pattern of binary symbols, the goal is to search for in quickly, with being fully scanned only once, namely, when the index is created. All indexing schemes published in the last thirty years support searching in worst-case time and require memory words (or bits), which is significantly larger than the text itself. In this paper we provide a breakthrough both in searching time and index space under the same model of computation as the one adopted in previous work. Based upon new compressed representations of suffix arrays and suffix trees, we construct an index structure that occupies only bits and compares favorably with inverted lists in space. We can search any binary pattern , stored in words, in only time.

Specifically, searching takes time for , and time for and any fixed . That is, we achieve optimal search time for sufficiently large . We can list all the pattern occurrences in optimal additional time when or when ; otherwise, listing takes additional time.

- R. Grossi, A. Gupta, and J. S. Vitter.
“High-Order Entropy-Compressed Text Indexes,”
*Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '03)*, Baltimore, MD, January 2003, 841-850. For an updated and more detailed version, see “An Algorithmic Framework for Compression and Text Indexing.”We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of symbols over an alphabet , where each symbol is encoded by bits. We show that compressed suffix arrays use just bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length in time. The term denotes the th-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that and the alphabet size is small, we obtain a text index with search time that requires only bits. We also report further results and tradeoffs on high-order entropy-compressed text indexes.

- L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter.
“When Indexing Equals Compression: Experiments on Suffix Arrays
and Trees,”
*ACM Transactions on Algorithms*,**2**(4), 2006, 611-639. An extended abstract appears in R. Grossi, A. Gupta, and J. S. Vitter, “When Indexing Equals Compression: Experiments with Compressing Suffix Arrays and Applications,”*Proceedings of the 15th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '04)*, New Orleans, LA, January 2004, 636-645..We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees comparable to previous work, yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size -- without requiring a separate instance of the text -- and support fast and powerful searches. To our knowledge, this is the best known method in terms of space for fast searching. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows-Wheeler transform), achieving a slightly better compression ratio than

`bzip2`. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with`gzip`. - L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter.
“Fast Compression with a Static Model in High-Order Entropy.”
*Proceedings of the 2004 IEEE Data Compression Conference (DCC '04)*, Snowbird, UT, March 2004, 23-25.We report on a simple encoding format called

`wzip`for decompressing block-sorting transforms, such as the Burrows-Wheeler Transform (BWT). Our compressor uses the simple notions of gamma encoding and RLE organized with a wavelet tree to achieve a slightly better compression ration than`bzip2`in less time. In fact, our compression/decompression time is dependent upon , the empirical th order entropy. Another key contribution of our compressor is its simplicity. Our compressor can also operate as a full-text index with a small amount of data, while still preserving backward compatibility with just the compressor. - L. Lim, M. Wang, S. Padmanabhan, J. S. Vitter, and R. Agarwal.
“Efficient Update of Indexes for Dynamically Changing Web
Documents,”
*World Wide Web*,**10**(1), March 2007, 37-69. An extended abstract appears in “Dynamic Maintenance of Web Indexes Using Landmarks,”*Proceedings of the 12th International World Wide Web Conference (WWW '03)*, Budapest, May 2003, 102-111.Recent work on incremental crawling has enabled the indexed document collection of a search engine to be more synchronized with the changing World Wide Web. However, this synchronized collection is not immediately searchable, because the keyword index is rebuilt from scratch less frequently than the collection can be refreshed. An inverted index is usually used to index documents crawled from the web. Complete index rebuild at high frequency is expensive. Previous work on incremental inverted index updates have been restricted to adding and removing documents. Updating the inverted index for previously indexed documents that have changed has not been addressed.

In this paper, we propose an efficient method to update the inverted index for previously indexed documents whose contents have changed. Our method uses the idea of landmarks together with the

`diff`algorithm to significantly reduce the number of postings in the inverted index that need to be updated. Our experiments verify that our landmark-diff method results in significant savings in the number of update operations on the inverted index. - A. Gupta, W. Hon, R. Shah, and J. S. Vitter.
“Compressed Data Structures: Dictionaries and the
Data-Aware Measures,”
*Theoretical Computer Science*,**387**(3), November 2007, 313-331.We propose measures for compressed data structures, in which space usage is measured in a data-aware manner. In particular, we consider the fundamental dictionary problem on set data, where the task is to construct a data structure to represent a set of items out of a universe and support various queries on . We use a well-known data-aware measure for set data called gap to bound the space of our data structures. We describe a novel dictionary structure taking bits. Under the RAM model, our dictionary supports membership, rank, select, and predecessor queries in nearly optimal time, matching the time bound of Andersson and Thorup's predecessor structure, while simultaneously improving upon their space usage. Our dictionary structure uses exactly gap bits in the leading term (i.e., the constant factor is ) and answers queries in near-optimal time. When seen from the worst case perspective, we present the first -bit dictionary structure which supports these queries in near-optimal time under RAM model. We also build a dictionary which requires the same space and supports membership, select, and partial rank queries even more quickly in time. To the best of our knowledge, this is the first of a kind result which achieves data-aware space usage and retains near-optimal time.

- R. Grossi, A. Gupta, and J. S. Vitter.
“An Algorithmic Framework for Compression and Text Indexing,”
manuscript.
This work is an extension of two separate pieces of
research in conference form:
“High-Order Entropy-Compressed Text Indexes,”
*Proceedings of the 14th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA '03)*, Baltimore, MD, January 2003, 841-850; and “Nearly Tight Bounds on the Encoding Length of the Burrows-Wheeler Transform,”*Proceedings of the 5th Workshop on Analytical Algorithmics and Combinatorics (ANALCO '08)*, San Francisco, CA, January 2008, 191-202.We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text of symbols drawn from an alphabet , our bounds are stated in terms of the th-order empirical entropy of the text, . In particular, we provide a tight analysis of the Burrows-Wheeler transform (BWT) establishing a bound of bits, where denotes the asymptotic number of bits required to store the empirical statistical model for contexts of order up to appearing in . Using the same framework, we also obtain an implementation of the compressed suffix array (CSA) which achieves bits of space while still retaining competitive full-text indexing functionality.

The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since , where and do not depend on the text length , while is the modified th-order empirical entropy of . Moreover, we show a strong relationship between a compressed full-text index and the succinct dictionary problem. We also examine the importance of lower-order terms, as these can dwarf any savings achieved by high-order entropy. We report further results and tradeoffs on high-order entropy-compressed text indexes in the paper.

- R. Shah, P. J. Varman, and J. S. Vitter.
“Online Algorithms for Prefetching and Caching in Parallel
Disks,”
*Proceedings of the 16th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '04)*, Barcelona, Spain, June 2004.Parallel disks provide a cost effective way of speeding up I/Os in applications that work with large amounts of data. The main challenge is to achieve as much parallelism as possible, using prefetching to avoid bottlenecks in disk access. Efficient algorithms have been developed for some particular patterns of accessing the disk blocks. In this paper, we consider general request sequences. When the request sequence consists of unique block requests, the problem is called prefetching and is a well-solved problem for arbitrary request sequences. When the reference sequence can have repeated references to the same block, we need to devise an effective caching policy as well. While optimum offline algorithms have been recently designed for the problem, in the online case, no effective algorithm was previously known. Our main contribution is a deterministic online algorithm threshold-LRU which achieves competitive ratio and a randomized online algorithm threshold-MARK which achieves competitive ratio for the caching/prefetching problem on the parallel disk model (PDM), where is the number of disks, is the size of fast memory buffer, and is the amount of lookahead available in the request sequence. The best-known lower bound on the competitive ratio is for lookahead in both models. We also show that if the deterministic online algorithm is allowed to have twice the memory of the offline then a tight competitive ratio of can be achieved. This problem generalizes the well-known paging problem on a single disk to the parallel disk model.

- T. M. Ghanem, R. Shah, M. F. Mokbel, W. G. Aref, and J. S. Vitter.
“Bulk Operations for Space-Partitioning Trees,”
*Proceedings of the 20th Annual IEEE International Conference on Data Engineering (ICDE '04)*, Boston, March-April 2004.The emergence of extensible index structures, e.g., GiST (Generalized Search Tree) and SP-GiST (Space-Partitioning Generalized Search Tree), calls for a set of extensible algorithms to support different operations (e.g., insertion, deletion, and search). Extensible bulk operations (e.g., bulk loading and bulk insertion) are of the same importance and need to be supported in these index engines. In this paper, we propose two extensible buffer-based algorithms for bulk operations in the class of space-partitioning trees; a class of hierarchical data structures that recursively decompose the space into disjoint partitions. The main idea of these algorithms is to build an in-memory tree of the target space-partitioning index. Then, data items are recursively partitioned into disk-based buffers using the in-memory tree. Although the second algorithm is designed for bulk insertion, it can be used in bulk loading as well. The proposed extensible algorithms are implemented inside SP-GiST; a framework for supporting the class of space-partitioning trees. Both algorithms have I/O bound , where is the number of data items to be bulk loaded/inserted, is the number of tree nodes that can fit in one disk page, is the tree height in terms of pages after applying a clustering algorithm. Experimental results are provided to show the scalability and applicability of the proposed algorithms for the class of space-partitioning trees. A comparison of the two proposed algorithms shows that the first algorithm performs better in case of bulk loading. However the second algorithm is more general and can be used for efficient bulk insertion.

- S. Muthukrishnan, R. Shah, and J. S. Vitter.
“Mining Deviants in Time Series Data Streams,”
*Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM '04)*, Santorini Island, Greece, June 2004.One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of deviants from Jagadish et al as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem--univariate vs multivariate time series, optimal vs nearoptimal vs heuristic solutions, offline vs streaming--our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.

- I. Ilyas, R. Shah, W. G. Aref, J. S. Vitter, and A. Elmagarmid.
“Rank-aware Query Optimization,”
*Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD '04)*, Paris, France, June 2004.Ranking is an important property that needs to be fully supported by current relational query engines. Recently, several rank-join query operators have been proposed based on rank aggregation algorithms. Rank-join operators progressively rank the join results while performing the join operation. The new operators have a direct impact on traditional query processing and optimization. We introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators. We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing engines. We experimentally evaluate our framework by modifying the query optimizer of an open-source database management system. The experiments show the validity of our framework and the accuracy of the proposed estimation model.

- R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter.
“Efficient Indexing Methods for Probabilistic Threshold
Queries over Uncertain Data,”
*Proceedings of the 30th International Conference on Very Large Databases (VLDB '04)*, Toronto, CA, August 2004.It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, it is only necessary to know whether the probability exceeds a given threshold; we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries.

In particular, we develop two index structures and associated algorithms to efficiently answer PTQs. The first index scheme is based on the idea of augmenting uncertainty information to an R-tree. We establish the difficulty of this problem by mapping one-dimensional intervals to a two-dimensional space, and show that the problem of interval indexing with probabilities is significantly harder than interval indexing which is considered a well-studied problem. To overcome the limitations of this R-tree based structure, we apply a technique we call variance-based clustering, where data points with similar degrees of uncertainty are clustered together. Our extensive index structure can answer the queries for various kinds of uncertainty pdfs, in an almost optimal sense. We conduct experiments to validate the superior performance of both indexing schemes.

- L. Lim, M. Wang, and J. S. Vitter.
“CXHist: An On-line Classification-based Histogram for XML
String Selectivity Estimation,”
*Proceedings of the 31st International Conference on Very Large Databases (VLDB '05)*, Trondheim, Norway, August-September 2005.Query optimization in IBM's System RX, the first truly hybrid relational-XML data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For most real XML data, the number of distinct string values at the leaf nodes is orders of magnitude larger than the set of distinct rooted tag paths. Hence, the real challenge lies in accurate selectivity estimation of the string predicates on the leaf values reachable via a given path.

In this paper, we present CXHist, a novel workload-aware histogram technique that provides accurate selectivity estimation on a broad class of XML string-based queries. CXHist builds a histogram in an on-line manner by grouping queries into buckets using their true selectivity obtained from query feedback. The set of queries associated with each bucket is summarized into feature distributions. These feature distributions mimic a Bayesian classifier that is used to route a query to its associated bucket during selectivity estimation. We show how CXHist can be used for two general types of (path,string) queries: exact match queries and substring match queries. Experiments using a prototype show that CXHist provides accurate selectivity estimation for both exact match queries and substring match queries.

- I. Ilyas, W. G. Aref, A. K. Elmagarmid, H. G. Elmongui, R. Shah, and J. S. Vitter.
“Adaptive Rank-aware Query Optimization in
Relational Databases,”
*ACM Transactions on Database Systems*,**31**(4), December 2006, 1257-1304.Rank-aware query processing has emerged as a key requirement in modern applications. In these applications, efficient and adaptive evaluation of top-k queries is an integral part of the application semantics. In this article, we introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query engines. The framework is based on extending the System R dynamic programming algorithm in both enumeration and pruning. We define ranking as an interesting physical property that triggers the generation of rank-aware query plans. Unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operators.We introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operator. To our knowledge, this is the first effort in estimating the needed input size for optimal rank aggregation algorithms. Costing ranking plans is key to the full integration of rank-join operators in real-world query processing engines.

Since optimal execution strategies picked by static query optimizers lose their optimality due to estimation errors and unexpected changes in the computing environment, we introduce several adaptive execution strategies for top-k queries that respond to these unexpected changes and costing errors. Our reactive reoptimization techniques change the execution plan at runtime to significantly enhance the performance of running queries. Since top-k query plans are usually pipelined and maintain a complex ranking state, altering the execution strategy of a running ranking query is an important and challenging task.

We conduct an extensive experimental study to evaluate the performance of the proposed framework. The experimental results are twofold: (1) we show the effectiveness of our cost-based approach of integrating ranking plans in dynamic programming cost-based optimizers; and (2) we show a significant speedup (up to 300%) when using our adaptive execution of ranking plans over the state-of-the-art mid-query reoptimization strategies.

- R. Cheng, Y. Xia, S. Prabhakar, R. Shah, and J. S. Vitter.
“Efficient Join Processing over Uncertain-Valued Attributes,”
*Proceedings of the 2006 ACM Conference on Information and Knowledge Management (CIKM '06)*, Arlington, VA, November 2006.In an uncertain database, each data item is modeled as a range associated with a probability density function. Previous works for this kind of data have focused on simple queries such as range and nearest-neighbor queries. Queries that join multiple relations have not been addressed in earlier work despite the significance of joins in databases. In this paper, we address probabilistic join over uncertain data, essentially a query that augments the results with probability guarantees to indicate the likelihood of each join tuple being part of the result. We extend the notion of join operators, such as equality and inequality, for uncertain data. We also study the performance of probabilistic join. We observe that a user may only need to know whether the probability of the results exceeds a given threshold, instead of the precise probability value. By incorporating this constraint, it is possible to achieve much better performance. In particular, we develop three sets of optimization techniques, namely item-level, page-level and index-level pruning, for different join operators. These techniques facilitate pruning with little space and time overhead, and are easily adapted to most join algorithms. We verify the performance of these techniques experimentally.

- W.-K. Hon, T.-W. Lam, R. Shah, S.-L. Tam, and J. S. Vitter.
“A Cache-Oblivious Index for Approximate String
Matching,”
*Theoretical Computer Science***412**(29), 2011, 3579-3588. A shorter version appears in*Proceedings of the 16th Annual Conference on Combinatorial Pattern Matching (CPM '07)*, London, Ontario, Canada, July 2007, published in Lecture Notes in Computer Science,**4580**Springer-Verlag, Berlin, Germany, 40-51.This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text of length and a positive integer , we want to construct an index of such that for any input pattern , we can find all its -error matches in efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies disk pages and finds all -error matches with I/Os, where denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require I/Os. The second index reduces the space to disk pages, and the I/O complexity is .

- M. Y. Eltabakh, W.-K. Hon, R. Shah, W. Aref, and J. S. Vitter.
“The SBC-tree: An Index for Run-Length Compressed Sequences,”
*Proceedings of the 11th International Conference on Extending Database Technology (EDBT '08)*, Nantes, France, March 2008, 523-534.Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological se- quence databases, multimedia, and facsimile transmission. One of the main challenges is how to operate, e.g., indexing, searching, and retrieval, on the compressed data without decompressing it. In this paper, we present the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure. The SBC-tree supports substring as well as prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of pages, where is the total length of the compressed sequences, and is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m takes I/O operations. Substring matching, prefix matching, and range search execute in an optimal I/O operations, where is the length of the compressed query pattern and is the query output size. We present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure, and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worst-case theoretical bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in the context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30% reduction in I/Os for the insertion operations, and retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

- A. Gupta, W.-K. Hon, R. Shah, and J. S. Vitter.
“Dynamic Rank/Select Dictionaries with Applications to XML Indexing,”
technical report.
We consider a central problem in text indexing: Given a text over an alphabet , construct a compressed data structure answering the queries , , and for a symbol . Many data structures consider these queries for static text . We consider the dynamic version of the problem, where we are allowed to insert and delete symbols at arbitrary positions of . This problem is a key challenge in compressed text indexing and has direct application to dynamic XML indexing structures that answer subpath queries [XBW].

We build on the results of [RRR, GMR] and give the best known query bounds for the dynamic version of this problem, supporting arbitrary insertions and deletions of symbols in . Specifically, with an amortized update time of , we suggest how to support , , and queries in time, for any . The best previous query times for this problem were , given by [Makinen Navarro]. Our bounds are competitive with state-of-the-art static structures [GMR]. Some applicable lower bounds for the partial sums problem [PD] show that our update/query tradeoff is also nearly optimal. In addition, our space bound is competitive with the corresponding static structures. For the special case of bitvectors (i.e., ), we also show the best tradeoffs for query/update time, improving upon the results of [Makinen Navarro, Hon, RRR].

Our focus on fast query/slower update is well-suited for a query-intensive XML indexing environment. Using the XBW transform [XBW], we also present a dynamic data structure that succinctly maintains an ordered labeled tree and supports a powerful set of queries on .

- A. Gupta, W.-K. Hon, R. Shah, and J. S. Vitter.
“A Framework for Dynamizing Succinct Data Structures,”
*Proceedings of the 34th Annual International Colloquium on Automata, Languages, and Programming (ICALP '07)*, Wrocaw, Poland, July 2007, published in Lecture Notes in Computer Science,**4596**Springer-Verlag, Berlin, Germany, 521-532.We present a framework to dynamize succinct data structures, to encourage their use over non-succinct versions in a wide variety of important application areas. Our framework can dynamize most state-of-the-art succinct data structures for dictionaries, ordinal trees, labeled trees, and text collections. Of particular note is its direct application to XML indexing structures that answer queries. Our framework focuses on achieving information-theoretically optimal space along with near-optimal update/query bounds.

As the main part of our work, we consider the following problem central to text indexing: Given a text over an alphabet , construct a compressed data structure answering the queries , , and for a symbol . Many data structures consider these queries for static text . We build on these results and give the best known query bounds for the dynamic version of this problem, supporting arbitrary insertions and deletions of symbols in .

Specifically, with an amortized update time of , any static succinct data structure for , taking time for queries, can be converted by our framework into a dynamic succinct data structure that supports , , and queries in time, for any constant . When , we achieve query times. Our update/query bounds are near-optimal with respect to the lower bounds.

- R. Grossi, A. Gupta, and J. S. Vitter.
“Nearly Tight Bounds on the Encoding Length of the
Burrows-Wheeler Transform,”
*Proceedings of the 5th Workshop on Analytical Algorithmics and Combinatorics (ANALCO '08)*, San Francisco, CA, January 2008, 191-202. For a more detailed version, see “An Algorithmic Framework for Compression and Text Indexing.”We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text of symbols drawn from an alphabet , our bounds are stated in terms of the th-order empirical entropy of the text, . In particular, we provide a tight analysis of the Burrows-Wheeler transform (BWT) establishing a bound of bits, where denotes the asymptotical number of bits required to store the empirical statistical model for contexts of order appearing in . Using the same framework, we also obtain an implementation of the compressed suffix array (CSA) that achieves bits of space while still retaining competitive full-text indexing functionality.

The novelty of the proposed framework lies in its use of the finite set model instead of the empirical probability model (as in previous work), giving us new insight into the design and analysis of our algorithms. For example, we show that our analysis gives improved bounds since , where and do not depend on the text length , while is the modified th-order empirical entropy of . Moreover, we show a strong relationship between a compressed full-text index and the succinct dictionary problem. We also examine the importance of lower-order terms, as these can dwarf any savings achieved by high-order entropy. We report further results and tradeoffs on high-order entropy-compressed text indexes in the paper.

- Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter.
“Geometric Burrows-Wheeler Transform: Linking Range
Searching and Text Indexing,”
*Proceedings of the 2008 IEEE Data Compression Conference (DCC '08)*, Snowbird, UT, March 2008.We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called Geometric Burrows-Wheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the best known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching.

- W.-K. Hon, R. Shah, P. J. Varman, and J. S. Vitter.
“Tight Competitive Ratios for Parallel Disk Prefetching,”
*Proceedings of the 20th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '08)*, Munich, Germany, June 2008, 352-361.We consider the natural extension of the well-known single disk caching problem to the parallel disk I/O model (PDM) [17]. The main challenge is to achieve as much parallelism as possible and avoid I/O bottlenecks. We are given a fast memory (cache) of size memory blocks along with a request sequence where each block resides on one of disks. In each parallel I/O step, at most one block from each disk can be fetched. The task is to serve in the minimum number of parallel I/Os. Thus, each I/O is analogous to a page fault. The difference here is that during each page fault, up to blocks can be brought into memory, as long as all of the new blocks entering the memory reside on different disks. The problem has a long history. Note that this problem is non-trivial even if all requests in are unique. This restricted version is called read-once. Despite the progress in the online ver- sion and read-once version, the general online problem still remained open. Here, we provide comprehensive results with a full general solution for the problem with asymptotically tight competitive ratios.

To exploit parallelism, any parallel disk algorithm needs a certain amount of lookahead into future requests. To provide effective caching, an online algorithm must achieve competitive ratio. We show a lower bound that states, for lookahead , that any online algorithm must be -competitive. For lookahead greater than , where is a constant, the tight upper bound of on competitive ratio is achieved by our algorithm SKEW. The previous algorithm tLRU was competitive and this was also shown to be tight for an LRU-based strategy. We achieve the tight ratio using a fairly different strategy than LRU. We also show tight results for randomized algorithms against oblivious adversary and give an algorithm achieving better bounds in the resource augmentation model.

- J. S. Vitter.
*Algorithms and Data Structures for External Memory*--*main reference!*Series on Foundations and Trends in Series on Foundations and Trends in Theoretical Computer Science, now Publishers, Hanover, MA, 2008. Also published as Volume 2, Issue 4 of*Foundations and Trends in Theoretical Computer Science*. This reference book gives a good general introduction to algorithms and data structures for use in external memory, when I/O can be a bottleneck.Slides for a talk (Adobe pdf format)

Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottleneck. In this book we discuss the state of the art in the design and analysis of external memory (or EM) algorithms and data structures, where the goal is to exploit locality in order to reduce the I/O costs. We consider a variety of EM paradigms for solving batched and online problems efficiently in external memory.

For the batched problem of sorting and related problems like permuting and fast Fourier transform, the key paradigms include distribution and merging. The paradigm of disk striping offers an elegant way to use multiple disks in parallel. For sorting, however, disk striping can be nonoptimal with respect to I/O, so to gain further improvements we discuss prefetching, distribution, and merging techniques for using the disks independently. We also consider useful techniques for batched EM problems involving matrices (such as matrix multiplication and transposition), geometric data (such as finding intersections and constructing convex hulls) and graphs (such as list ranking, connected components, topological sorting, and shortest paths). In the online domain, canonical EM applications include dictionary lookup and range searching. The two important classes of indexed data structures are based upon extendible hashing and B-trees. The paradigms of filtering and bootstrapping provide a convenient means in online data structures to make effective use of the data accessed from disk. We also reexamine some of the above EM problems in slightly different settings, such as when the data items are moving, when the data items are variable-length (e.g., text strings), when the internal data representations are compressed, or when the allocated amount of internal memory can change dynamically.

Programming tools and environments are available for simplifying the EM programming task. During the course of the book, we report on some experiments in the domain of spatial databases using the TPIE system (Transparent Parallel I/O programming Environment). The newly developed EM algorithms and data structures that incorporate the paradigms we discuss are significantly faster than methods currently used in practice.

This book is an expanded version of a shorter survey article.

- P. Ferragina, R. Grossi, A. Gupta, R. Shah, and J. S. Vitter.
“On Searching Compressed String Collections Cache-Obliviously,”
*Proceedings of the 27th Annual ACM Symposium on Principles of Database Systems (PODS '08)*, Vancouver, Canada, June 2008.Current data structures for searching large string collections are limited in that they either fail to achieve minimum space or they cause too many cache misses. In this paper, we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and storable in compressed space. The widely known frontcoding scheme is one example of linearization; it is at the core of Prefix B-trees and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-effective heuristic without efficient search support.

In this paper, we introduce new insights on front-coding and other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. The second contribution of this paper engineers these linearizations to design a novel dictionary encoding scheme that achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries and obtain a succinct variant, whose space is close to the information-theoretic minimum.

- W.-K. Hon, R. Shah, P. J. Varman, and J. S. Vitter.
“Tight Competitive Ratios for Parallel Disk Prefetching,”
*Proceedings of the 20th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '08)*, Munich, Germany, June 2008, 352-361.We consider the natural extension of the well-known single disk caching problem to the parallel disk I/O model (PDM) [17]. The main challenge is to achieve as much parallelism as possible and avoid I/O bottlenecks. We are given a fast memory (cache) of size memory blocks along with a request sequence where each block resides on one of disks. In each parallel I/O step, at most one block from each disk can be fetched. The task is to serve in the minimum number of parallel I/Os. Thus, each I/O is analogous to a page fault. The difference here is that during each page fault, up to blocks can be brought into memory, as long as all of the new blocks entering the memory reside on different disks. The problem has a long history. Note that this problem is non-trivial even if all requests in are unique. This restricted version is called read-once. Despite the progress in the online ver- sion and read-once version, the general online problem still remained open. Here, we provide comprehensive results with a full general solution for the problem with asymptotically tight competitive ratios.

To exploit parallelism, any parallel disk algorithm needs a certain amount of lookahead into future requests. To provide effective caching, an online algorithm must achieve competitive ratio. We show a lower bound that states, for lookahead , that any online algorithm must be -competitive. For lookahead greater than , where is a constant, the tight upper bound of on competitive ratio is achieved by our algorithm SKEW. The previous algorithm tLRU was competitive and this was also shown to be tight for an LRU-based strategy. We achieve the tight ratio using a fairly different strategy than LRU. We also show tight results for randomized algorithms against oblivious adversary and give an algorithm achieving better bounds in the resource augmentation model.

- Y.-F. Chien, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter,
“Geometric BWT: Compressed Text Indexing via Sparse Suffixes and Range Searching,”
*Algorithmica*,**71**, 2015, published online May 2013, 258-178. A shorter version of some results appeared in W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter, “On Entropy-Compressed Text Indexing in External Memory,”*Proceedings of the 16th International Conference on String Processing and Information Retrieval (SPIRE '09)*, Saariselkä, Finland, August 2009, published in Lecture Notes in Computer Science,**5721**, Springer, Berlin, Germany, 75-89.We introduce a new variant of the popular Burrows-Wheeler transform (BWT), called Geometric Burrows-Wheeler Transform (GBWT), which converts a text into a set of points in 2-dimensional geometry.We also introduce a reverse transform, called Points2Text, which converts a set of points into text. Using these two transforms, we show strong equivalence between data structural problems in geometric range searching and text pattern matching. This allows us to apply the lower bounds known in the field of orthogonal range searching to the problems in compressed text indexing. In addition, we give the first succinct (compact) index for I/O-efficient pattern matching in external memory, and show how this index can be further improved to achieve higher-order entropy compressed space.

- S.-Y. Chiu, W.-K. Hon, R. Shah, and J. S. Vitter.
“I/O-efficient Compressed Text Indexes: From Theory to Practice,”
*Proceedings of the 2010 IEEE Data Compression Conference (DCC '10)*, Snowbird, UT, March 2010.Slides for talk (gzip-compressed postscript)

Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident.

Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory.

In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs. I/O efficiency trade-offs for our structures.

- W.-K. Hon, R. Shah, and J. S. Vitter.
“Compression, Indexing, and Retrieval for Massive String Data,”
invited keynote paper in
*Proceedings of the 19th Annual Conference on Combinatorial Pattern Matching (CPM '10)*, New York, NY, June 2010, published in Lecture Notes in Computer Science,**4580**Springer, Berlin, Germany, 40-51. This keynote address from CPM 2010 gives a quick history of indexing using compressed data structures, as well as some of the current challenges to realize their full potential.Slides for CPM '10 keynote talk (Adobe pdf)

The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.

- J. S. Vitter.
“Compressed Data Structures with Relevance,"
invited keynote abstract,
*Proceedings of the 2012 ACM Conference on Information and Knowledge Management (CIKM '12)*, Maui, Hawaii, October-November 2012.Slides for CIKM '12 keynote talk (Adobe pdf)

We describe recent breakthroughs in the field of compressed data structures, in which the data structure is stored in a compressed representation that still allows fast answers to queries. We focus in particular on compressed data structures to support the important application of pattern matching on massive document collections. Given an arbitrary query pattern in textual form, the job of the data structure is to report all the locations where the pattern appears. Another variant is to report all the documents that contain at least one instance of the pattern. We are particularly interested in reporting only the most relevant documents, using a variety of notions of relevance. We discuss recently developed techniques that support fast search in these contexts as well as under additional positional and temporal constraints.

- S. V. Thankachan, W.-K. Hon, M. Patil, R. Shah, and J. S. Vitter.
“Indexes for Document Retrieval with Relevance,”
*Ian Munro Festschrift*, August 2013, published in Lecture Notes in Computer Science, Festschrift Series, Springer, Berlin, Germany.Document retrieval is a special type of pattern matching that is closely related to information retrieval and web searching. In this problem, the data consist of a collection of text documents, and given a query pattern , we are required to report all the documents (not all the occurrences) in which this pattern occurs. In addition, the notion of relevance is commonly applied to rank all the documents that satisfy the query, and only those documents with the highest relevance are returned. Such a concept of relevance has been central in the effectiveness and usability of present day search engines like Google, Bing, Yahoo, or Ask. When relevance is considered, the query has an additional input parameter , and the task is to report only the documents with the highest relevance to , instead of finding all the documents that contains . For example, one such relevance function could be the frequency of the query pattern in the document. In the information retrieval literature, this task is best achieved by using inverted indexes. However, if the query consists of an arbitrary string--which can be a partial word, multiword phrase, or more generally any sequence of characters--we cannot take advantages of the word boundaries and we need a different approach.

This leads to one of the active research topics in string matching and text indexing community in recent years, and various aspects of the problem have been studied, such as space-time tradeoffs, practical solutions, multipattern queries, and I/O-efficiency. In this article, we review some of the initial frameworks for designing such indexes and also summarize the developments in this area.

- R. Shah, C. Sheng, S. V. Thankachan, and J. S. Vitter.
“Top- Document Retrieval in External Memory,”
*Proceedings of the 21st Annual European Symposium on Algorithms (ESA '13)*, Sophia-Antipolis, France, September 2013, published in Lecture Notes in Computer Science, Springer, Berlin, Germany.Let be a given set of (string) documents of total length . The top- document retrieval problem is to index such that when a pattern of length , and a parameter come as a query, the index returns those documents which are most relevant to . Hon et al. [HSV09] proposed a linear space framework to solve this problem in time. This query time was improved to by Navarro and Nekrich [NN12]. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. In this paper, we propose the first external memory index supporting top- document retrieval queries (outputs unsorted) in optimal I/Os, where is the block size. The index space is almost linear words, where is the iterated logarithm of . We also improve the existing internal memory results. Specifically, we propose a linear space index for retrieving top- documents in time, once the locus of the pattern match is given.

- Y. Nekrich and J. S. Vitter.
“Optimal Color Range Reporting in One Dimension,”
*Proceedings of the 21st Annual European Symposium on Algorithms (ESA '13)*, Sophia-Antipolis, France, September 2013, published in Lecture Notes in Computer Science, Springer, Berlin, Germany.Color (or categorical) range reporting is a variant of the orthogonal range reporting problem in which every point in the input is assigned a color. While the answer to an orthogonal point reporting query contains all points in the query range , the answer to a color reporting query contains only distinct colors of points in . In this paper we describe an -space data structure that answers one-dimensional color reporting queries in optimal time, where is the number of colors in the answer and is the number of points in the data structure. Our result can be also dynamized and extended to the external memory model.

- Y. Nekrich, M. Patil, R. Shah, S. V. Thankachan, and J. S. Vitter.
“`Categorical Range Maxima Queries,”
*Proceedings of the 33rd Annual ACM Symposium on Principles of Database Systems (PODS '14)*, Snowbird, UT, June 2014.Given an array A[1...n] of n distinct elements from the set 1, 2, ..., n a range maximum query RMQ(a, b) returns the highest element in A[a...b] along with its position. In this paper, we study a generalization of this classical problem called Categorical Range Maxima Query (CRMQ) problem, in which each element A[i] in the array has an associated category (color) given by C[i] ∈ [σ]. A query then asks to report each distinct color c appearing in C[a...b] along with the highest element (and its position) in A[a...b] with color c. Let pc denote the position of the highest element in A[a...b] with color c. We investigate two variants of this problem: a threshold version and a top-k version. In threshold version, we only need to output the colors with A[pc] more than the input threshold τ, whereas top-k variant asks for k colors with the highest A[pc] values. In the word RAM model, we achieve linear space structure along with O(k) query time, that can report colors in sorted order of A[•]. In external memory, we present a data structure that answers queries in optimal O(1+k/B) I/O's using almost-linear O(n log* n) space, as well as a linear space data structure with O(log* n + k/B) query I/Os. Here k represents the output size, log* n is the iterated logarithm of n and B is the block size. CRMQ has applications to document retrieval and categorical range reporting - giving a one-shot framework to obtain improved results in both these problems. Our results for CRMQ not only improve the existing best known results for three-sided categorical range reporting but also overcome the hurdle of maintaining color uniqueness in the output set.

- X. Chen, H. Huo, J. Huan, and J. S. Vitter.
“Efficient Graph Similarity Search in External Memory,”
*IEEE Access*,**5**, March 14, 2017, 4551-4560,`dx.doi.org/10.1109/ACCESS.2017.2682107`.Many real-world applications, such as bioinformatics, data mining, pattern recognition, and social network analysis, benefit from efficient solutions for the graph similarity search problem. Existing methods have limited scalability when they handle the large graph databases, for example, those with millions or billions of graphs that cannot fit in main memory. In this paper, we study the problem of graph similarity search under the graph edit distance constraint in external memory. We present an efficient framework for arbitrary q-gram based representations of a graph. Specifically, we propose a q-gram matrix index stored in hybrid layout in external memory to achieve efficient query processing, by converting the q-gram counting filter into a sparse matrix-vector multiplication (SpMV) problem. Furthermore, we also boost the query performance by transforming the global filter to a two-dimensional query rectangle, which allows us to perform a query in a reduced region, significantly reducing the number of query I/Os in practice. Extensive experiments on real datastes confirm that: (1) our method can compete with the state-of-the-art in-memory methods in index size and filtering ability, and outperform them on scalability of coping with the PubChem dataset including 25 million chemical structure graphs. (2) compared with the popular q-gram-based external inverted index, our external index structure needs much fewer number of query I/Os on the PubChem dataset.

- H. Huo, X. Chen, Y. Zhao, X. Zhu, and J. S. Vitter.
“Practical Succinct Text Indexes in External Memory,”
*Proceedings of the 2018 Data Compression Conference (DCC '18)*, Snowbird, UT, March 2018, 217-226.Chien et al. [1, 2] introduced the geometric Burrows-Wheeler transform (GBWT) as the first succinct text index for I/O-efficient pattern matching in external memory; it operates by transforming a text into point set in the two-dimensional plane. In this paper we introduce a practical succinct external memory text index, called mKD-GBWT. We partition into subregions by partitioning the x-axis into intervals using the suffix ranges of characters of and partitioning the y-axis into intervals using characters of , where is the alphabet size of . In this way, we can represent a point using fewer bits and perform a query in a reduced region so as to improve the space usage and I/Os of GBWT in practice. In addition, we plug a crit-bit tree into each node of string B-trees to represent variable-length strings stored. Experimental results show that mKD-GBWT provides significant improvement in space usage compared with the state-of-the-art indexing techniques. The source code is available online [3].

- H. Huo, X. Chen, X. Guo, and J. S. Vitter.
“Efficient Compression and Indexing for Highly Repetitive DNA Sequence Collections,”
*IEEE/ACM Transactions on Computational Biology and Bioinformatics*, January 2020.The development of the next-generation, high-throughput sequencing technologies dramatically reduces the cost of the next-generation sequencing (NGS) data production, thereby leading to the explosive growth in the NGS data.

In this paper, we focus upon the important problem of indexing and searching highly repetitive DNA sequence collections. Given a collection of sequences of length each, we can represent succinctly in bits using time, where is the th-order empirical entropy of the sequence that is used as the reference sequence, is the total number of variations between and the sequences in , and is a small fixed constant. We can restore the length- substring of in time and report the occurrences where occurs in in time. In addition, we propose a method to find the variations between and the sequences in , with which we can build succinct structures to enable fast search. For highly repetitive sequences, experimental results on the tested data demonstrate that the proposed method has significant advantages in space usage and retrieval time over the current state-of-the-art methods.

The source code is available online.

- H. Huo, P. Liu, C. Wang, H. Jiang, and J. S. Vitter.
“CIndex: Compressed Indexes for Fast Retrieval of FASTQ Files,”
*Bioinformatics*, published electronically September 15, 2021,`doi.org/10.1093/bioinformatics/btab655`, 9 pages.We propose a compressed index for FASTQ files called CIndex. CIndex uses the Burrows-Wheeler transform and the wavelet tree, combined with hybrid encoding, succinct data structures, and special tables, to achieve minimal space usage and fast retrieval on the compressed FASTQ files. Experiments conducted over real publicly available datasets from various sequencing instruments demonstrate that our proposed index substantially outperforms existing state-of-the-art solutions. For count, locate, and extract queries on reads, our method uses 2.7-41.66 percentage points less space and provides a speedup of 70-167.16 times, 1.44-35.57 times, and 1.3-55.4 times. For extracting records in FASTQ files, our method uses 2.86-14.88 percentage points less space and provides a speedup of 3.13-20.1 times. CIndex has an additional advantage in that it can be readily adapted to work as a general-purpose text index; experiments show that it performs very well in practice.

The software is available on Github: https://github.com/Hongweihuo-Lab/CIndex.

- H. Huo, C. Hong, and J. S. Vitter.
“Practical High-order Entropy-compressed Text Indexing Schemes,”
*IEEE Transactions on Knowledge and Data Engineering*, 2021,`doi.org/10.1109/TKDE.2021.3114401`. Source code published in*Code Ocean*, November 2020,`codeocean.com/capsule/3554560/tree/v1`.Compressed self-indexes are used widely in string processing applications, such as information retrieval, genome analysis, data mining, and web searching. The index not only indexes the data, but also encodes the data, and it is in compressed form. Moreover, the index and the data it encodes can be operated upon directly, without need to uncompress the entire index, thus saving time while maintaining small storage space. In some applications, such as in genome analysis, existing methods do not exploit the full possibilities of compressed self-indexes, and thus we seek faster and more space-efficient indexes. In this paper, we propose a practical high-order entropy-compressed self-index for efficient pattern matching in a text. We give practical implementations of compressed suffix arrays using a hybrid encoding in the representation of the neighbor function. We analyze the performance in theory and practice of our recommended indexing method, called GeCSA. We can improve retrieval time further using an iterated version of the neighbor function. Experimental results on the tested data demonstrate that the proposed index GeCSA has good overall advantages in space usage and retrieval time over the state-of-the-art indexing methods, especially on the repetitive data.