Opzioni
Abstract
This thesis deals with space-efficient algorithms to compress and index texts. The aim
of compression is to reduce the size of a text by exploiting regularities such as repeti-
tions or skewed character distributions. Indexing, on the other hand, aims at accelerating
pattern matching queries (such as locating all occurrences of a pattern) with the help of
data structures (indexes) on the text. Despite being apparently distant concepts, com-
pression and indexing are deeply interconnected: both exploit the inner structure of the
text to, respectively, reduce its size and speed up pattern matching queries. It should not
be surprising, therefore, that compression and indexing can be “forced” (actually, quite
naturally) to coincide: compressed full-text indexes support fast pattern matching queries
while taking almost the same size of the compressed text.
In the last two decades, several compressed text indexes based on the most efficient
text compressors have been designed. These indexes can be roughly classified in two main
categories: those based on suffix-sorting (Burrows-Wheeler transform indexes, compressed
suffix arrays/trees) and those based on the replacement of repetitions by pointers (Lempel-
Ziv indexes, grammar indexes, block trees). Indexes based on a combination of the two
techniques have also been proposed. In general, suffix sorting-based methods support very
fast queries at the expense of space usage. This is due to several factors, ranging from
weak compression methods (e.g. entropy compression, used in early FM-indexes, is not
able to capture long repetitions), to heavy structures (e.g. suffix array sampling) flanked
to the compressed text representation to speed up queries. The second class of indexes, on
the other hand, offers strong compression rates, achieving up to exponential compression
on very repetitive texts at the expense of query times—often quadratic in the pattern
length or—in the worst case—linear in the text length.
Among the most used compression techniques, run-length compression of the Burrows-
Wheeler transform and Lempel-Ziv parsing (LZ77) have been proved to be superior in
the compression of very repetitive datasets. In this thesis we show that these two tools
can be combined in a single index gathering the best features of the above-discussed
indexes: fast queries (linear in the pattern length and logarithmic in the text length), and
strong compression rates (up to exponential compression can be achieved). We describe an
efficient implementation of our index and compare it with state of the art alternatives. Our
solution turns out to be as space-efficient as the lightest index described in the literature
while supporting queries up to three orders of magnitude faster.
Apart from index definition and design, a third concern regards index construction
complexity. Often, the input text is too big to be fully loaded into main memory. Even
when this is feasible, classic compression/indexing algorithms use heavy data structures
such as suffix trees/arrays which can easily take several times the space of the text. This is
unsatisfactory, especially in cases where (i) the text is streamed and not stored anywhere
(e.g. because of its size) and (ii) the compressed text is small enough to fit into main
memory. A further contribution of this thesis consists in five algorithms compressing text
within compressed working space and in two recompression techniques (i.e. algorithms toconvert between different compression formats without full decompression). The complete
picture we offer consists of a set of algorithms to space-efficiently convert among:
• the plain text
• two compressed self-indexes, and
• three compressed-file formats (entropy, LZ77, and run-length BWT)
The general idea behind all our compression algorithms is to read text characters from
left to right and build a compressed dynamic Burrows-Wheeler transform of the reversed
text. This structure is augmented with a dynamic suffix array sampling to support fast
locate of text substrings. We employ three types of suffix array sampling: (i) evenly-spaced
(ii) based on Burrows-Wheeler transform equal-letter runs, and (iii) based on Lempel-Ziv
factors. Strategy (i) allows us to achieve entropy-compressed working space. Strategies
(ii) and (iii) are novel and allow achieving a space usage proportional to the output size
(i.e. the compressed file/index).
As a last contribution of this thesis, we turn our attention to a practical and usable
implementation of our suite of algorithmic tools. We introduce DYNAMIC, an open-source
C++11 library implementing dynamic compressed data-structures. We prove almost-
optimal theoretical bounds for the resources used by our structures, and show that our
theoretical predictions are empirically tightly verified in practice. The implementation of
the compression algorithms described in this thesis using DYNAMIC meets our expectations:
on repetitive datasets our solutions turn out to be up to three orders of magnitude more
space-efficient than state-of-the art algorithms performing the same tasks.
Archivio
Diritti
open access