We introduce and analyze a new one-pass algorithm for constructing
dynamic Huffman codes and also analyze the one-pass algorithm due to
Faller, Gallager, and Knuth. In each algorithm, both the sender and the
receiver maintain equivalent dynamically varying Huffman trees, and the
coding is done in real time. We show that the number of bits used by
the new algorithm to encode a message containing
letters
is
bits more than that used by the conventional two-pass Huffman scheme,
independent of the alphabet size. This is best possible in the worst
case, for any one-pass Huffman method. Tight upper and lower bounds are
derived. Empirical tests show that the encodings produced by the new
algorithm are shorter than those of the other one-pass algorithm and,
except for long messages, are shorter than those of the two-pass method.
The new algorithm is well-suited for online encoding/decoding in data
networks and for file compression.
We present a Pascal implementation of the one-pass algorithm for
constructing dynamic Huffman codes that is described and analyzed in a
companion paper [Vitter, 1987]. The program runs in real time; that is,
the processing time for each letter of the message is proportional to
the length of its codeword. The number of bits used to encode a message
of
letters is less than
bits more than that used by the
well-known two-pass algorithm. This is best possible for any one-pass
Huffman scheme. In practice it uses fewer bits than all other Huffman
schemes. The algorithm has applications in file compression and network
transmission.
We present a new method for error modeling applicable to the MLP algorithm for hierarchical lossless image compression. This method, based on a concept called the variability index, provides accurate models for pixel prediction errors without requiring explicit transmission of the models. We also use the variability index to show that prediction errors do not always follow the Laplace distribution, as is commonly assumed; replacing the Laplace distribution with a more general symmetric exponential distribution further improves compression. We describe a new compression measurement called compression gain, and we give experimental results showing that the MLP method using the variability index technique for error modeling gives significantly more compression gain than other methods in the literature.
Caching and prefetching are important mechanisms for speeding up access
time to data on secondary storage. Recent work in competitive online
algorithms has uncovered several promising new algorithms for caching.
In this paper, we apply a form of the competitive philosophy for the
first time to the problem of prefetching to develop an optimal universal
prefetcher in terms of fault ratio, with particular applications to
large-scale databases and hypertext systems. Our algorithms for
prefetching are novel in that they are based on data compression
techniques that are both theoretically optimal and good in practice.
Intuitively, in order to compress data effectively, you have to be able
to predict future data well, and thus good data compressors should be
able to predict well for purposes of prefetching. We show for powerful
models such as Markov sources and
th order Markov sources that the
page fault rates incurred by our prefetching algorithms are optimal in
the limit for almost all sequences of page accesses.
Response time delays caused by I/O is a major problem in many systems and database applications. Prefetching and cache-replacement methods are attracting renewed attention because of their success in avoiding costly I/Os. Prefetching can be looked upon as a type of online sequential prediction, where the predictions must be accurate as well as made in a computationally efficient way. Unlike other online problems, prefetching cannot admit a competitive analysis, since the optimal offline prefetcher incurs no cost when it knows the future page requests. Previous analytical work on prefetching by Vitter and Krishnan consisted of modeling the user as a probabilistic Markov source.
In this paper, we look at the much stronger form of worst-case analysis and derive a randomized algorithm that we prove analytically converges almost surely to the optimal fault rate in the worst case for every sequence of page request with respect to the important class of finite state prefetchers. In particular, we make no assumption about how the sequence of page requests is generated. This analysis model can be looked upon as a generalization of the competitive framework, in that it compares an online algorithm in a worst-case manner over all sequences against a powerful yet non-clairvoyant opponent. We simultaneously achieve the computational goal of implementing our prefetcher in optimal constant expected time per prefetched page, using the optimal dynamic discrete random variate generator of Matias, Vitter, and Ni.
We show that high-resolution images can be encoded and decoded efficiently in parallel. We present an algorithm based on the hierarchical MLP method, used either with Huffman coding or with a new variant of arithmetic coding called quasi-arithmetic coding. The coding step can be parallelized, even though the codes for different pixels are of different lengths; parallelization of the prediction and error modeling components is straightforward.
We present new vector quantization algorithms based on the theory developed by the authors. The new approach is to formulate a vector quantization problem as a 0-1 integer linear program. We first solve its relaxed linear program by linear programming techniques. Then we transform the linear program solution into a provably good solution for the vector quantization problem. These methods lead to the first known polynomial-time full-search vector quantization codebook design algorithm and tree pruning algorithm with provable worst-case performance guarantees. We also introduce the notion of pseudo-random pruned tree-structured vector quantizers. Initial experimental results on image compression are very encouraging.
We give a detailed algorithm for fast text compression. Our algorithm, related to the PPM method, simplifies the modeling phase by eliminating the escape mechanism, and speeds up coding by using a combination of quasi-arithmetic coding and Rice coding. We provide details of the use of quasi-arithmetic code tables, and analyze their compression performance. Our Fast PPM method is shown experimentally to be almost twice as fast as the PPMC method, while giving comparable compression.
We present a new method for lossless image compression that gives compression comparable to JPEG lossless mode with about five times the speed. Our method, called FELICS, is based on a novel use of two neighboring pixels for both prediction and error modeling. For coding we use single bits, adjusted binary codes, and Golomb or Rice codes. For the latter we present and analyze a provably good method for estimating the single coding parameter. (Note: This method is the foundation for the subsequently developed state-of-the-art methods now used for lossless image compression.)
We present a method for progressive lossless compression of still grayscale images that combines the speed of our earlier FELICS method with the progressivity of our earlier MLP method. We use MLP's pyramid-based pixel sequence, and image and error modeling and coding based on that of FELICS. In addition, we introduce a new prefix code with some advantages over the previously used Golomb and Rice codes. Our new progressive method gives compression ratios and speeds similar to those of non-progressive FELICS and those of JPEG lossless mode, also a non-progressive method.
The image model in Progressive FELICS is based on a simple function of four nearby pixels. We select two of the four nearest known pixels, using the two with the middle (non-extreme) values. Then we code the pixel's intensity relative to the selected pixels, using single bits, adjusted binary codes, and simple prefix codes like Golomb codes, Rice codes, or the new family of prefix codes introduced here. We estimate the coding parameter adaptively for each context, the context being the absolute value of the difference of the predicting pixels; we adjust the adaptation statistics at the beginning of each level in the progressive pixel sequence.
Arithmetic coding provides an effective mechanism for removing redundancy in the encoding of data. We show how arithmetic coding works and describe an efficient implementation that uses table lookup as a fast alternative to arithmetic operations. The reduced-precision arithmetic has a provably negligible effect on the amount of compression achieved. We can speed up the implementation further by use of parallel processing. We discuss the role of probability models and how they provide probability information to the arithmetic coder. We conclude with perspectives on the comparative advantages and disadvantages of arithmetic coding.
We compare methods for choosing motion vectors for motion-compensated
video compression. Our primary focus is on videophone and
videoconferencing applications, where very low bit rates are necessary,
where the motion is usually limited, and where the frames must be coded
in the order they are generated. We provide evidence, using established
benchmark videos of this type, that choosing motion vectors to minimize
codelength subject to (implicit) constraints on quality yields
substantially better rate-distortion tradeoffs than minimizing notions
of prediction error. We illustrate this point using an algorithm within
the
standard. We show that using quadtrees to code the motion
vectors in conjunction with explicit codelength minimization yields
further improvement. We describe a
dynamic-programming algorithm for choosing a quadtree to minimize the
codelength.
Motivated by the desire to find text compressors that compress better than existing dictionary methods, but run faster than PPM implementations, we describe methods for text compression using multiple dictionaries, one for each context of preceding characters, where the contexts have varying lengths. The context to be used is determined using an escape mechanism similar to that of PPM methods. We describe modifications of three popular dictionary coders along these lines and experiments evaluating their efficacy using the text files in the Calgary corpus. Our results suggest that modifying LZ77 along these lines yields an improvement in compression of about 4%, that modifying LZFG yields a compression improvement of about 8%, and that modifying LZW in this manner yields an average improvement on the order of 12%.
We present and compare methods for choosing motion vectors for
motion-compensated video coding. Our primary focus is on videophone and
videoconferencing applications, where very low bit rates are necessary,
where motion is usually limited, and where frames must be coded in the
order they are generated. We provide evidence, using established
benchmark videos typical of these applications, that choosing motion
vectors explicitly to minimize rate, subject to implicit constraints on
distortion, yields better rate-distortion tradeoffs than minimizing
notions of prediction error. Minimizing a linear combination of rate
and distortion results in further rate-distortion improvements. Using a
heuristic function of the prediction error and the motion vector
codelength results in compression performance comparable to the more
computationally intensive coders while running much faster. We
incorporate these ideas into coders that operate within the
standard.
We present and compare methods for choosing motion vectors for block-based motion-compensated video coding. The primary focus is on videophone and video-conferencing applications, where low bit rates are necessary, where motion is usually limited, and where the amount of computation is also limited. In a typical block-based motion-compensated video coding system, motion vectors are transmitted along with a lossy encoding of the residuals. As the bit rate decreases, the proportion required to transmit the motion vectors increases. We provide experimental evidence that choosing motion vectors explicitly to minimize rate (including motion vector coding), subject to implicit constraints on distortion, yields better rate-distortion tradeoffs than minimizing some measure of prediction error. Minimizing a combination of rate and distortion yields further improvements. Although these explicit-minimization schemes are computationally intensive, they provide invaluable insight which we use to develop practical algorithms. We show that minimizing a simple heuristic function of the prediction error and the motion vector code-length results in rate-distortion performance comparable to explicit-minimization schemes while being computationally feasible. Experimental results are provided for coders that operate within the H.261 standard.
We consider re-representing the alphabet so that a representation of a character reflects its properties as a predictor of future text. This enables us to use an estimator from a restricted class to map contexts to predictions of upcoming characters. We describe an algorithm that uses this idea in conjunction with neural networks. The performance of this implementation is compared to other compression methods, such as UNIX compress, gzip, PPMC, and an alternative neural network approach.
Video belongs to a class of information called continuous media. Continuous media is characterized by the essentially continuous manner in which the information is presented. This is in contrast to discrete media, in which there is no essential temporal component. Text, images, and graphics are examples of discrete media, while movies, sound, and computer animation are examples of continuous media. Even though a slide show is a time-based presentation of images, it is not a continuous medium since each image is viewed as an individual item. On the other hand, a video clip, while also consisting of a sequence of images, is a continuous medium since each image is perceived in the context of past and future images.
With continuous media, therefore, the temporal dimension becomes important. For example, a video sequence compressed with a constant image quality for every frame is often more desirable than one in which the image quality varies noticeably over time. However, because the compressibility of individual frames varies over time, maintaining a constant image quality results in a variation in coding rate over time. The process of controlling the coding rate to meet the requirements of a transmissions channel or storage device, while maintaining a desired level of quality, is called bit rate control. In this monograph, we focus on the rate control of compressed video. Specifically, we present a new framework for allocating bits to the compression of pictures in a video sequence.
Existing optimal rate control techniques typically regulate the coding rate to minimize a sum-distortion measure. While these techniques can leverage the wealth of tools from least-mean-square optimization theory, they do not guarantee constant-quality video, an objective often mentioned in the literature. In this book, we develop a framework that casts rate control as a resource allocation problem with continuous variables, nonlinear constraints, and a novel lexicographic optimality criterion that is motivated for uniform video quality. With the lexicographic criterion, we propose a new concept of coding efficiency to better reflect the constancy in quality that is generally desired from a video coder.
Rigorous analysis within this framework reveals a set of necessary and sufficient conditions for optimality for coding at both constant and variable bit rates. With these conditions, we are able to construct polynomial-time algorithms for optimal bit rate control. Experimental implementations of these algorithms confirm the theoretical analysis and produce encodings that are more uniform in quality than that achieved with existing rate control methods. As evidence of the generality and flexibility of the framework, we show how to extend the framework to allocate bits among multiple variable bit rate bitstreams that are to be transmitted over a common constant bit rate channel and to encompass the case of discrete variables.
Slides for talk (Adobe pdf format)
The proliferation of online text, such as on the World Wide Web and
in databases, motivates the need for space-efficient index methods
that support fast search. Consider a text
of
binary symbols
to index. Given any query pattern
of
binary symbols, the
goal is to search for
in
quickly, with
being fully
scanned only once, namely, when the index is created. All indexing
schemes published in the last thirty years support searching in
worst-case time and require
memory words (or
bits), which is significantly larger than the
text itself. In this paper we provide a breakthrough both in
searching time and index space under the same model of computation
as the one adopted in previous work. Based upon new compressed
representations of suffix arrays and suffix trees, we construct an
index structure that occupies only
bits and compares
favorably with inverted lists in space. We can search any binary
pattern
, stored in
words, in only
time.
Specifically, searching takes
time for
, and
time for
and any fixed
. That is, we achieve optimal
search time for sufficiently large
.
We can list all the
pattern occurrences in optimal
additional time when
or when
; otherwise, listing takes
additional time.
We present a novel implementation of compressed suffix arrays exhibiting
new tradeoffs between search time and space occupancy for a given text
(or sequence) of
symbols over an alphabet
, where each
symbol is encoded by
bits. We show that compressed
suffix arrays use just
bits,
while retaining full text indexing functionalities, such as searching
any pattern sequence of length
in
time. The term
denotes the
th-order empirical
entropy of the text, which means that our index is nearly optimal in
space apart from lower-order terms, achieving asymptotically the
empirical entropy of the text (with a multiplicative constant 1).
If the text is highly compressible so that
and the alphabet
size is small, we obtain a text index with
search time that
requires only
bits. We also report further results and tradeoffs
on high-order entropy-compressed text indexes.
We report on a new and improved version of high-order entropy-compressed suffix arrays, which has theoretical performance guarantees comparable to previous work, yet represents an improvement in practice. Our experiments indicate that the resulting text index offers state-of-the-art compression. In particular, we require roughly 20% of the original text size -- without requiring a separate instance of the text -- and support fast and powerful searches. To our knowledge, this is the best known method in terms of space for fast searching. We can additionally use a simple notion to encode and decode block-sorting transforms (such as the Burrows-Wheeler transform), achieving a slightly better compression ratio than bzip2. We also provide a compressed representation of suffix trees (and their associated text) in a total space that is comparable to that of the text alone compressed with gzip.
We report on a simple encoding format called wzip for
decompressing block-sorting transforms, such as the Burrows-Wheeler
Transform (BWT). Our compressor uses the simple notions of gamma
encoding and RLE organized with a wavelet tree to achieve a slightly
better compression ration than bzip2 in less time. In fact, our
compression/decompression time is dependent upon
, the empirical
th order entropy. Another key contribution of our compressor is its
simplicity. Our compressor can also operate as a full-text index with a
small amount of data, while still preserving backward compatibility with
just the compressor.
We propose measures for compressed data
structures, in which space usage is measured in a data-aware manner. In
particular, we consider the fundamental dictionary problem on
set data, where the task is to construct a data structure to
represent a set
of
items out of a universe
and support various queries on
. We use a well-known
data-aware measure for set data called gap to bound the space of
our data structures. We describe a novel dictionary structure taking
bits. Under the RAM
model, our dictionary supports membership, rank, select, and predecessor
queries in nearly optimal time, matching the time bound of Andersson and
Thorup's predecessor structure, while simultaneously improving upon their
space usage. Our dictionary structure uses exactly gap bits in the
leading term (i.e., the constant factor is
) and answers queries in
near-optimal time. When seen from the worst case perspective, we present
the first
-bit dictionary structure which supports these
queries in near-optimal time under RAM model. We also build a dictionary
which requires the same space and supports membership, select, and
partial rank queries even more quickly in
time. To the
best of our knowledge, this is the first of a kind result which achieves
data-aware space usage and retains near-optimal time.
We present an experimental study of the space-time tradeoffs for the dictionary problem. Our primary goal is to reduce the space requirement for storing a dictionary data structure. Many compression schemes have been developed for dictionaries, which fall generally in the categories of combinatorial encodings and data-aware methods and still support queries efficiently. We show that for many real-world datasets, data-aware methods lead to a worthwhile compression over combinatorial methods. Additionally, we design a new data-aware building block structure called BSGAP that presents improvements over other data-aware methods.
We present a unified algorithmic framework to obtain nearly
optimal space
bounds for text compression and compressed text indexing,
apart from
lower-order terms. For a text
of
symbols drawn from
an
alphabet
, our bounds are stated in terms of the
th-order
empirical entropy of the text,
. In particular, we
provide a tight
analysis of the Burrows-Wheeler transform (BWT)
establishing a bound of
bits, where
denotes
the
asymptotic number of bits required to store the empirical
statistical
model for contexts of order up to
appearing
in
. Using the same
framework, we also obtain an implementation of the
compressed suffix array
(CSA) which achieves
bits of space while still retaining competitive
full-text indexing
functionality.
The novelty of the proposed framework lies in its use of the
finite set model instead of the empirical probability model
(as in previous work), giving us new insight into the design
and analysis of our algorithms. For example, we show that
our analysis gives improved bounds since
,
where
and
do not depend on the
text length
, while
is the modified
th-order empirical entropy of
. Moreover, we show a
strong relationship between a compressed full-text index and
the succinct dictionary problem. We also examine the
importance of lower-order terms, as these can dwarf any
savings achieved by high-order entropy. We report further
results and tradeoffs on high-order entropy-compressed text
indexes in the paper.
Run-Length-Encoding (RLE) is a data compression technique
that is used in various applications, e.g., biological se-
quence databases, multimedia, and facsimile transmission.
One of the main challenges is how to operate, e.g.,
indexing, searching, and retrieval, on the compressed data
without decompressing it. In this paper, we present the
String B-tree for Compressed sequences, termed the SBC-tree,
for indexing and searching RLE-compressed sequences of
arbitrary length. The SBC-tree is a two-level index
structure based on the well-known String B-tree and a
3-sided range query structure. The SBC-tree supports
substring as well as prefix matching, and range search
operations over RLE-compressed sequences. The SBC-tree has
an optimal external-memory space complexity of
pages,
where
is the total length of the compressed sequences, and
is the disk page size. The insertion and deletion of all
suffixes of a compressed sequence of length m takes
I/O operations. Substring matching, prefix
matching, and range search execute in an optimal
I/O operations, where
is the length of the
compressed query pattern and
is the query output size. We
present also two variants of the SBC-tree: the SBC-tree that
is based on an R-tree instead of the 3-sided structure, and
the one-level SBC-tree that does not use a two-dimensional
index.
These variants do not have provable worst-case theoretical
bounds for search operations, but perform well in practice.
The SBC-tree index is realized inside PostgreSQL in the
context of a biological protein database
application. Performance results illustrate that using the
SBC-tree to index
RLE-compressed sequences achieves up to an order of
magnitude reduction in storage, up to 30% reduction in I/Os
for
the insertion operations, and retains the optimal search
performance achieved by the String B-tree over the uncompressed sequences.
We consider a central problem in text indexing: Given a
text
over an alphabet
, construct a compressed
data structure answering the queries
,
, and
for a symbol
.
Many data structures consider these queries for static
text
. We consider the dynamic version of the problem,
where we are allowed to insert and delete symbols at
arbitrary positions of
. This problem is a key challenge
in compressed text indexing and has direct application to
dynamic XML indexing structures that answer subpath
queries [XBW].
We build on the results of [RRR, GMR] and give the best
known query bounds for the dynamic version of this problem,
supporting arbitrary insertions and deletions of symbols
in
. Specifically, with an amortized update time of
, we suggest how to support
,
, and
queries in
time, for any
.
The best previous query times for this problem were
, given by [Makinen Navarro]. Our bounds
are competitive with state-of-the-art static structures
[GMR]. Some applicable lower bounds for the partial sums
problem [PD] show that our update/query tradeoff is also
nearly optimal. In addition, our space bound is competitive
with the corresponding static structures. For the special
case of bitvectors (i.e.,
), we also show the
best tradeoffs for query/update time, improving upon the
results of [Makinen Navarro, Hon, RRR].
Our focus on fast query/slower update is
well-suited for a query-intensive XML indexing environment.
Using the XBW transform [XBW], we also present a
dynamic data structure that succinctly maintains an ordered
labeled tree
and supports a powerful set of queries
on
.
We present a framework to dynamize succinct data structures,
to encourage their use over non-succinct versions in a wide
variety of important application areas. Our framework can
dynamize most state-of-the-art succinct data structures for
dictionaries, ordinal trees, labeled trees, and text
collections. Of particular note is its direct application to
XML indexing structures that answer
queries. Our
framework focuses on achieving information-theoretically
optimal space along with near-optimal update/query bounds.
As the main part of our work, we consider the following
problem central to text indexing: Given a text
over an
alphabet
, construct a compressed data structure
answering the queries
,
, and
for a symbol
. Many data
structures consider these queries for static text
. We
build on these results and give the best known query bounds
for the dynamic version of this problem, supporting
arbitrary insertions and deletions of symbols in
.
Specifically, with an amortized update time of
, any static succinct data structure
for
, taking
time for queries, can be converted by our
framework into a dynamic succinct data structure that
supports
,
,
and
queries in
time, for any constant
. When
, we achieve
query times. Our update/query bounds are near-optimal
with respect to the lower bounds.
We present a unified algorithmic framework to obtain nearly
optimal space bounds for text compression and compressed
text indexing, apart from lower-order terms. For a text
of
symbols drawn from an alphabet
, our bounds are stated in
terms of the
th-order empirical entropy of the text,
. In
particular, we provide a tight analysis of the
Burrows-Wheeler transform (BWT) establishing a bound of
bits, where
denotes the asymptotical
number of bits required to store the empirical statistical
model for contexts of order
appearing in
. Using the same
framework, we also obtain an implementation of the
compressed suffix array (CSA) that achieves
bits of space while still retaining
competitive full-text indexing functionality.
The novelty
of the proposed framework lies in its use of the finite set
model instead of the empirical probability model (as in
previous work), giving us new insight into the design and
analysis of our algorithms. For example, we show that our
analysis gives improved bounds since
, where
and
do not depend on
the text length
, while
is the modified
th-order empirical entropy of
. Moreover, we show a strong
relationship between a compressed full-text index and the
succinct dictionary problem. We also examine the importance
of lower-order terms, as these can dwarf any savings
achieved by high-order entropy. We report further results
and tradeoffs on high-order entropy-compressed text indexes
in the paper.
We introduce a new variant of the popular Burrows-Wheeler transform (BWT) called Geometric Burrows-Wheeler Transform (GBWT). Unlike BWT, which merely permutes the text, GBWT converts the text into a set of points in 2-dimensional geometry. Using this transform, we can answer to many open questions in compressed text indexing: (1) Can compressed data structures be designed in external memory with similar performance as the uncompressed counterparts? (2) Can compressed data structures be designed for position restricted pattern matching? We also introduce a reverse transform, called Points2Text, which converts a set of points into text. This transform allows us to derive the best known lower bounds in compressed text indexing. We show strong equivalence between data structural problems in geometric range searching and text pattern matching. This provides a way to derive new results in compressed text indexing by translating the results from range searching.
The past few years have witnessed several exciting results
on compressed representation of a string
that supports
efficient pattern matching, and the space complexity has been
reduced to
bits, where
denotes the
th-order empirical entropy of
, and
is the
size of the alphabet. In this paper we study compressed
representation for another classical problem of string
indexing, which is called dictionary matching in the
literature. Precisely, a collection
of strings (called
patterns) of total length
is to be indexed so that given a
text
, the occurrences of the patterns in
can be found
efficiently. In this paper we show how to exploit a sampling
technique to compress the existing O(n)-word index to an
-bit index with only a small sacrifice
in search time.
Current data structures for searching large string collections are limited in that they either fail to achieve minimum space or they cause too many cache misses. In this paper, we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and storable in compressed space. The widely known frontcoding scheme is one example of linearization; it is at the core of Prefix B-trees and many other disk-conscious compressed indexes for string collections. However, it is largely thought of as a space-effective heuristic without efficient search support.
In this paper, we introduce new insights on front-coding and other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. The second contribution of this paper engineers these linearizations to design a novel dictionary encoding scheme that achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries and obtain a succinct variant, whose space is close to the information-theoretic minimum.
A new trend in the field of pattern matching is to design
indexing data structures
which take the space very close to that required by the
indexed text (in
entropy-compressed form) and also simultaneously achieve
good query performance. Two
popular indexes, namely the FM-index of Ferragina and Manzini and the
CSA of Grossi and Vitter,
achieve this goal by exploiting the Burrows-Wheeler
transform (BWT).
However, due to the intricate permutation structure of BWT,
no locality
of reference can be guaranteed when we perform pattern
matching with these indexes.
Chien et al. gave an alternative text
index which is based
on sparsifying the traditional suffix tree and maintaining
an auxiliary 2-D range query structure.
Given a text
of length
drawn from a
-sized
alphabet set,
they achieved an
-bit index for
and
showed that this index can preserve locality in pattern
matching and hence
is amenable to be used in external-memory settings.
We improve upon this index and show how to apply entropy
compression to reduce index space.
Our index takes
bits of
space where
is the
th-order
empirical entropy of the text.
This is achieved by creating variable length blocks of text
using arithmetic coding.
Given a set
of
strings of
total length
, our task is to report the ``most
relevant'' strings for a given query pattern
. This
involves somewhat more advanced query functionality than the
usual pattern matching, as some notion of ``most relevant''
is involved. In information retrieval literature, this task
is best achieved by using inverted indexes. However,
inverted indexes work only for some predefined set of
patterns. In the pattern matching community, the most
popular pattern-matching data structures are suffix trees
and suffix arrays. However, a typical suffix tree search
involves going through all the occurrences of the pattern
over the entire string collection, which might be a lot more
than the required relevant documents.
The first formal framework to study such kind of retrieval
problems was given by Muthukrishnan. He considered two
metrics for relevance: frequency and proximity. He took a
threshold-based approach on these metrics and gave data
structures taking
words of space. We study
this problem in a slightly different framework of reporting
the top
most relevant documents (in sorted order) under
similar and more general relevance metrics. Our framework
gives linear space data structure with optimal query times
for arbitrary score functions. As a corollary, it improves
the space utilization for the problems considered by
Muthukrishnan while maintaining optimal query performance.
We also develop compressed variants of these data structures
for several specific relevance metrics.
Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident.
Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory.
In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs. I/O efficiency trade-offs for our structures.
Slides for CPM '10 keynote talk (Adobe pdf)
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
This study explores an alternative way of storing text files to answer
exact match queries faster. We decompose the original file into two
parts as filter and payload. The filter part contains the most
informative
bits of each byte, and the remaining bits of the bytes
are concatenated in order of appearance to generate the payload. We
refer to this structure as
-bit filtered format. When an input
pattern is to be searched on the
-bit filtered structure, the same
decomposition is performed on the pattern. The
bits from each byte
of the pattern form the pattern filter bit sequence, and the rest is
the payload. The pattern filter is first scanned on the filter part
of the file. At each match position detected in the filter part, the
pattern payload is verified against the corresponding location in the
payload part of the text. Thus, instead of searching an
-byte
pattern on an
-byte text, the first
bits are scanned on
bits, followed by a verification of
bits on
the respective locations of the matching positions. Experiments
conducted on natural language texts, plain ASCII DNA sequences, and
random byte sequences showed that the search performance with the
proposed scheme is on average two times faster than the tested best
exact pattern matching algorithms. The highest gain is obtained on
plain ASCII DNA sequences. We also developed an effective bitwise
pattern matching algorithm of possible independent interest within
this study.
Background: Genomic read alignment involves mapping (exactly or approximately) short reads from a particular individual onto a pre-sequenced reference genome of the same species. Because all individuals of the same species share the majority of their genomes, short reads alignment provides an alternative and much more efficient way to sequence the genome of a particular individual than does direct sequencing. Among many strategies proposed for this alignment process, indexing the reference genome and short read searching over the index is a dominant technique. Our goal is to design a space-efficient indexing structure with fast searching capability to catch the massive short reads produced by the next generation high-throughput DNA sequencing technology.
Results: We concentrate on indexing DNA sequences via sparse suffix arrays
(SSAs) and propose a new short read aligner named
-RA (PSI-RA:
parallel sparse index read aligner). The motivation in using SSAs is
the ability to trade memory against time. It is possible to fine tune
the space consumption of the index based on the available memory of
the machine and the minimum length of the arriving pattern queries.
Although SSAs have been studied before for exact matching of short
reads, an elegant way of approximate matching capability was missing.
We provide this by defining the rightmost mismatch criteria that
prioritize the errors towards the end of the reads, where errors are
more probable.
-RA supports any number of mismatches in aligning
reads. We give comparisons with some of the well-known short read
aligners, and show that indexing a genome with SSA is a good
alternative to the Burrows-Wheeler transform or seed-based solutions.
Conclusions:
-RA is expected to serve as a valuable tool in the alignment of
short reads generated by the next generation high-throughput
sequencing technology.
-RA is very fast in exact matching and
also supports rightmost approximate matching. The SSA structure that
-RA is built on naturally incorporates the modern multicore
architecture and thus further speed-up can be gained. All the
information, including the source code of
-RA, can be downloaded
at: http://www.busillis.com/o_kulekci/PSIRA.zip.
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8GB internal memory (actual internal memory usage is less than 6GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.
Given a set
of
patterns of total length
, the dictionary matching problem
is to index
such that for any query text
, we can locate the occurrences of any
pattern within
efficiently. This problem can be solved in optimal
time by
the classical AC automaton (Aho and Corasick, 1975) where
denotes the number of
occurrences. The space requirement is
words. In the approximate dictionary match-
ing problem with one error, we consider a substring of
an occurrence of
whenever
the edit distance between
and
is at most 1. For this problem, the best known
indexes are by Cole et al. (2004), which requires
words of space and reports
all occurrences in
time, and by Ferragina et al. (1999), which
requires
words of space and reports all occurrences in
time.
Recently, there have been successes in compressing the dictionary matching index while
keeping the query time optimal (Belazzougui, 2010; Hon et al., 2010). However, a com-
pressed index for approximate dictionary matching problem is still open. In this paper,
we propose the rst such index which requires an optimal
-bit
index space, where
denotes the
th-order empirical entropy of
,
and is the size of
alphabet set from which all the characters in
and
are drawn. The query time of our
index is
.
The wavelet tree data structure is a space-efficient technique
for rank and select queries that generalizes from binary characters to
an arbitrary multicharacter alphabet. It has become a key tool in
modern full-text indexing and data compression because of its
capabilities in compressing, indexing, and searching. We present a
comparative study of its practical performance regarding a wide range
of options on the dimensions of different coding schemes and tree
shapes. Our results are both theoretical and experimental:
(1) We show that the run-length
coding size of wavelet trees
achieves the 0-order empirical entropy size of the original string
with leading constant 1, when the string's 0-order empirical entropy
is asymptotically less than the logarithm of the alphabet size. This
result complements the previous works that are dedicated to analyzing
run-length
-encoded wavelet trees. It also reveals the
scenarios when run-length
encoding becomes practical.
(2) We introduce a full generic package of wavelet trees for a wide
range of options on the dimensions of coding schemes and tree shapes.
Our experimental study reveals the practical performance of the
various modifications.
Inverted indexes are the most fundamental and widely used
data structures in information retrieval. For each unique
word occurring in a document collection, the inverted index
stores a list of the documents in which this word occurs.
Compression techniques are often applied to further reduce
the space requirement of these lists. However, the index
has a shortcoming, in that only predefined pattern queries
can be supported efficiently. In terms of string documents
where word boundaries are undefined, if we have to index all
the substrings of a given document, then the storage quickly
becomes quadratic in the data size. Also, if we want to apply
the same type of indexes for querying phrases or sequence of
words, then the inverted index will end up storing redundant
information. In this paper, we show the first set of inverted
indexes which work naturally for strings as well as phrase
searching. The central idea is to exclude document
in the
inverted list of a string
if every occurrence of
in
is
subsumed by another string of which
is a prefix. With this
we show that our space utilization is close to the optimal.
Techniques from succinct data structures are deployed to
achieve compression while allowing fast access in terms of
frequency and document id based retrieval. Compression
and speed tradeoffs are evaluated for different variants of
the proposed index. For phrase searching, we show that our
indexes compare favorably against a typical inverted index
deploying position-wise intersections. We also show efficient
top-
based retrieval under relevance metrics like frequency
and tf-idf.
T.-H. Ku, W.-K. Hon, R. Shah, S. V. Thankachan, and J. S. Vitter. ``Compressed Text Indexing With Wildcards,'' in preparation. An extended abstract appears in Proceedings of the 18th International Conference on String Processing and Information Retrieval (SPIRE '11), Pisa, Italy, October 2011, published in Lecture Notes in Computer Science, Springer, Berlin, Germany.
Let
be a text of
total length
, where characters of each
are chosen from an
alphabet
of size
, and
denotes a wildcard
symbol. The text indexing with wildcards problem is to index
such
that when we are given a query pattern
, we can locate the
occurrences of
in
efficiently. This problem has been applied
in indexing genomic sequences that contain single-nucleotide
polymorphisms (SNP) because SNP can be modeled as wildcards. Recently
Tam et al. (2009) and Thachuk (2011) have proposed succinct indexes
for this problem. In this paper, we present the first compressed
index for this problem, which takes only
bits space, where
is the
th-order
empirical entropy (
) of
.