Travis Gagie

Travis Gagie

Dalhousie University

H-index: 31

North America-Canada

About Travis Gagie

Travis Gagie, With an exceptional h-index of 31 and a recent h-index of 21 (since 2020), a distinguished researcher at Dalhousie University, specializes in the field of data structures, data compression.

Travis Gagie Information

University

Dalhousie University

Position

Associate Professor at

Citations(all)

3353

Citations(since 2020)

1887

Cited By

2189

hIndex(all)

31

hIndex(since 2020)

21

i10Index(all)

81

i10Index(since 2020)

54

Email

University Profile Page

Dalhousie University

Travis Gagie Skills & Research Interests

data structures

data compression

Top articles of Travis Gagie

Faster MEM-finding in space

Authors

Travis Gagie

Journal

arXiv preprint arXiv:2403.02008

Published Date

2024/3/4

Suppose we are given a text , a straight-line program with rules for and an assignment of tags to the characters in such that the Burrows-Wheeler Transform of has runs, the Burrows-Wheeler Transform of the reverse of has runs and the tag array -- the list of tags in the lexicographic order of the suffixes starting at the characters the tags are assigned to -- has runs. If the alphabet size is at most polylogarithmic in then there is an -space index for such that when we are given a pattern we can compute the maximal exact matches (MEMs) of with respect to in time plus time per MEM and then list the distinct tags assigned to the first characters of occurrences of that MEM in constant time per tag listed, all correctly with high probability.

Stronger compact representations of object trajectories

Authors

Adrián Gómez-Brandón,Gonzalo Navarro,José R Paramá,Nieves R Brisaboa,Travis Gagie

Journal

Geo-spatial Information Science

Published Date

2024/2/10

GraCT and ContaCT were the first compressed data structures to represent object trajectories, demonstrating that it was possible to use orders of magnitude less space than classical indexes while staying competitive in query times. In this paper we considerably enhance their space, query capabilities, and time performance with three contributions. (1) We design and evaluate algorithms for more sophisticated nearest neighbor queries, finding the trajectories closest to a given trajectory or to a given point during a time interval. (2) We modify the data structure used to sample the spatial positions of the objects along time. This improves the performance on the classic spatio-temporal and the nearest neighbor queries, by orders of magnitude in some cases. (3) We introduce RelaCT, a tradeoff between the faster and larger ContaCT and the smaller and slower GraCT, offering a new relevant space-time tradeoff for large …

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

Authors

Dominika Draesslerová,Omar Ahmed,Travis Gagie,Jan Holub,Ben Langmead,Giovanni Manzini,Gonzalo Navarro

Journal

arXiv preprint arXiv:2402.06935

Published Date

2024/2/10

For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any -tuple, for a parameter ; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated (``true positive'' reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Pfp-fm: an accelerated FM-index

Authors

Aaron Hong,Marco Oliva,Dominik Köppl,Hideo Bannai,Christina Boucher,Travis Gagie

Journal

Algorithms for Molecular Biology

Published Date

2024/4/10

FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing—which takes parameters that let us tune the average length of the phrases—instead of induced suffix sorting, gives a significant speedup for patterns of only a few …

Wheeler Maps

Authors

Adrián Goga,Andrej Baláz,Travis Gagie,Simon Heumos,Gonzalo Navarro,Alessia Petescia,Jouni Sirén

Journal

Workshop on Bioinformatics and Computational Biology WBCB 2023

Published Date

2023/9/22

Motivated by the challenges in pangenomic read alignment, we propose a generalization of Wheeler graphs [2] that we call Wheeler maps. A Wheeler map stores a text T [1.. n] and an assignment of tags to the characters of T such that we can preprocess a pattern P [1.. m] and then, given i and j, quickly return all the distinct tags labelling the first characters of the occurrences of P [i.. j] in T.For the applications that most interest us, characters with long common contexts are likely to have the same tag, so we consider the number t of runs in the list of tags sorted by their characters’ positions in the Burrows-Wheeler Transform (BWT) of T. We show how, given a straight-line program with g rules for T, we can build an O (g+ r+ t)-space Wheeler map, where r is the number of runs in the BWT of T, with which we can preprocess a pattern P [1.. m] in O (mlogn) time and then return the k distinct tags for P [i.. j] in the optimal O (k) time for any given i and j. To this end, we combine the r-index machinery [4] for compressed text indexing with the document listing data structure of Muthukrishnan [3]. Furthermore, for a parameter f fixed at construction time, we show how we can efficiently report all the distinct tags that each label at least f occurrences of P [i.. j] in T. In addition, we also provide techniques to efficiently count the number of distinct tags of P [i.. j] and to list top-k most frequent tags that label the occurrences of P [i.. j].

Space-efficient conversions from SLPs

Authors

Travis Gagie,Adrián Goga,Artur Jeż,Gonzalo Navarro

Published Date

2024/3/6

We give algorithms that, given a straight-line program (SLP) with g rules that generates (only) a text T[1..n], build within O(g) space the Lempel-Ziv (LZ) parse of T (of z phrases) in time or in time . We also show how to build a locally consistent grammar (LCG) of optimal size from the SLP within space and in time, where is the substring complexity measure of T. Finally, we show how to build the LZ parse of T from such an LCG within space and in time . All our results hold with high probability.

Faster Maximal Exact Matches with Lazy LCP Evaluation

Authors

Adrián Goga,Lore Depuydt,Nathaniel K Brown,Jan Fostier,Travis Gagie,Gonzalo Navarro

Journal

arXiv preprint arXiv:2311.04538

Published Date

2023/11/8

MONI (Rossi et al., {\it JCB} 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries. In this paper we show how (a variant of) the latter can be evaluated lazily, so as to bound the total time MONI needs to process the pattern in terms of the number of MEMs between the pattern and the text, while maintaining logarithmic latency.

Faster compressed quadtrees

Authors

Guillermo de Bernardo,Travis Gagie,Susana Ladra,Gonzalo Navarro,Diego Seco

Journal

Journal of Computer and System Sciences

Published Date

2023/2/1

Real-world point sets tend to be clustered, so using a machine word for each point is wasteful. In this paper we first show how a compact representation of quadtrees using O (1) bits per node can break this bound on clustered point sets, while offering efficient range searches. We then describe a new compact quadtree representation based on heavy-path decompositions, which supports queries faster than previous compact structures. We present experimental evidence showing that our structure is competitive in practice.

Acceleration of FM-Index Queries Through Prefix-Free Parsing

Authors

Aaron Hong,Marco Oliva,Dominik Köppl,Hideo Bannai,Christina Boucher,Travis Gagie

Journal

arXiv preprint arXiv:2305.05893

Published Date

2023/5/10

FM-indexes are a crucial data structure in DNA alignment, for example, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. Last year, Deng et al.\ proposed parsing genomic data by induced suffix sorting, and showed the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer. In this paper we show that using prefix-free parsing -- which takes parameters that let us tune the average length of the phrases -- instead of induced suffix sorting, gives a significant speedup for patterns of only a few hundred characters. We implement our method and demonstrate it is between 3 and 18 times faster than competing methods on queries to GRCh38. And was consistently faster on queries made to 25,000, 50,000 and 100,000 SARS-CoV-2 genomes. Hence, it is very clear that our method accelerates the performance of count over all state-of-the-art methods with a minor increase in the memory. Our source code is available at https://github.com/marco-oliva/afm .

Dalhousie University, Halifax, NS B3H 4R2, Canada {travis. gagie, michael. stdenis}@ dal. ca, mhe@ cs. dal. ca

Authors

Travis Gagie,Meng He

Journal

String Processing and Information Retrieval: 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings

Published Date

2023/9/19

This paper presents a way to compactly represent dynamic connected planar embeddings, which may contain self loops and multiedges, in 4m+ o (m) bits, to support basic navigation in O (lgn) time and edge and vertex insertion and deletion in O (lg¹+ n) time, where n and m are respectively the number of vertices and edges currently in the graph and€ is an arbitrary positive constant. Previous works on dynamic succinct planar graphs either consider decremental settings only or are restricted to triangulations where the outer face must be a simple polygon and all inner faces must be triangles. To the best of our knowledge, this paper presents the first representation of dynamic compact connected planar embeddings that supports a full set of dynamic operations without restrictions on the sizes or shapes of the faces.

Movi: a fast and cache-efficient full-text pangenome index

Authors

Mohsen Zakeri,Nathaniel K Brown,Omar Y Ahmed,Travis Gagie,Ben Langmead

Journal

bioRxiv

Published Date

2023/11/5

Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the “move structure” was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O (r) space and O (1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi’s index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification–such as pseudo-matching lengths–at least ten times faster than the fastest available methods. Movi achieves this …

Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing

Authors

Marco Oliva,Travis Gagie,Christina Boucher

Journal

bioRxiv

Published Date

2023/1/27

MotivationPangenomics alignment has emerged as an opportunity to reduce bias in biomedical research. Traditionally, short read aligners—such as Bowtie and BWA—were used to index a single reference genome, which was then used to find approximate alignments of reads to that genome. Unfortunately, these methods can only index a small number of genomes due to the linear-memory requirement of the algorithms used to construct the index. Although there are a couple of emerging pangenome aligners that can index a larger number of genomes more algorithmic progress is needed to build an index for all available data.ResultsEmerging pangenomic methods include VG, Giraffe, and Moni, where the first two methods build an index a variation graph from the multiple alignment of the sequences, and Moni simply indexes all the sequences in a manner that takes the repetition of the sequences into account. Moni uses a preprocessing technique called prefix-free parsing to build a dictionary and parse from the input—these, in turn, are used to build the main run-length encoded BWT, and suffix array of the input. This is accomplished in linear space in the size of the dictionary and parse. Therein lies the open problem that we tackle in this paper. Although the dictionary scales nicely (sub-linear) with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, we need to remove the parse from the construction of the RLBWT and suffix array. We accomplish this, in this paper by applying prefix-free parsing recursively on the parse. Although conceptually simple, this leads to an …

Recursive Prefix-Free Parsing for Building Big BWTs

Authors

Marco Oliva,Travis Gagie,Christina Boucher

Published Date

2023/3/21

Prefix-free parsing is useful for a wide variety of purposes including building the BWT, constructing the suffix array, and supporting compressed suffix tree operations. This linear-time algorithm uses a rolling hash to break an input string into substrings, where the resulting set of unique substrings has the property that none of the substrings’ suffixes (of more than a certain length) is a proper prefix of any of the other substrings’ suffixes. Hence, the name prefix-free parsing. This set of unique substrings is referred to as the dictionary. The parse is the ordered list of dictionary strings that defines the input string. Prior empirical results demonstrated the size of the parse is more burdensome than the size of the dictionary for large, repetitive inputs. Hence, the question arises as to how the size of the parse can scale satisfactorily with the input. Here, we describe our algorithm, recursive prefix-free parsing, which accomplishes …

and Nicola Prezza¹D

Authors

Nicola Cotumaccio¹,Travis Gagie,Dominik Köppl

Journal

String Processing and Information Retrieval: 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings

Published Date

2023/9/19

Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires O (nlogn) bits, n being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant.In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time by only storing a linear number of bits. We use our technique to provide a space-time tradeoff to compute matching statistics on a Wheeler DFA. In addition, we show that by augmenting the BOSS representation of a k-th order de Bruijn graph with a linear number of bits we can navigate the underlying variable-order de Bruijn graph in time logarithmic in k, thus improving a previous bound by Boucher et al. which was linear in k [DCC 2015].

Data Structures for SMEM-Finding in the PBWT

Authors

Paola Bonizzoni,Christina Boucher,Davide Cozzi,Travis Gagie,Dominik Köppl,Massimiliano Rossi

Published Date

2023/9/20

The positional Burrows–Wheeler Transform (PBWT) was presented as a means to find set-maximal exact matches (SMEMs) in haplotype data via the computation of the divergence array. Although run-length encoding the PBWT has been previously considered, storing the divergence array along with the PBWT in a compressed manner has not been as rigorously studied. We define two queries that can be used in combination to compute SMEMs, allowing us to define smaller data structures that support one or both of these queries. We combine these data structures, enabling the PBWT and the divergence array to be stored in a manner that allows for finding SMEMs. We estimate and compare the memory usage of these data structures, leading to one data structure that is most memory efficient. Lastly, we implement this data structure and compare its performance to prior methods using various datasets taken from …

Pangenomic Alignment: Strings plus Graphs

Authors

Travis Gagie

Journal

4th Belgrade Bioinformatics Conference

Published Date

2023

The use of only one or a few reference genomes for DNA alignment is known to bias research results and medical diagnoses, but aligning against many reference genomes has been problematic. If we represent such a pangenomic reference as a set of strings, then each seed we find in a DNA read may occur in many of the genomes, so even reporting all those occurrences can be slow, and extending and chaining seeds can be infeasible. On the other hand, if we represent them as a graph then --- even apart from the significant technical challenges of indexing graphs --- we may find many chimeric matches. The more of humanity’s genetic diversity we try to represent in the graph, the fuzzier it becomes, and the greater the probability of spurious results. Most research on pangenomic alignment uses either a string representation or a graph representation, but not both. In this talk we first describe how a tool called MONI indexes a pangenomic reference as a set of strings in small space such that later, for each maximal exact match in a given read, we can quickly find that match’s length, the position of one of its occurrences in the set of strings, and the lexicographic rank of the suffix starting with that occurrence. We then describe how a tool called MARIA will, when fully implemented, store a pangenomic reference as a graph in small space such that, given MONI’s output about a maximal exact match, we can quickly report all the non-chimeric occurrences of that match in the graph. Combining MONI and MARIA will give us the advantages of working with both strings and graphs: we index the set of reference genomes, the whole set of reference …

Computing matching statistics on Wheeler DFAs

Authors

Alessio Conte,Nicola Cotumaccio,Travis Gagie,Giovanni Manzini,Nicola Prezza,Marinella Sciortino

Published Date

2023/3/21

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree-notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

Paola Bonizzoni¹ (), Christina Boucher2, Davide Cozzi¹, Travis Gagie³, Dominik Köppl4, and Massimiliano Rossi² 1 University of Milano-Bicocca, Milano, Italy paola. bonizzoni …

Authors

Travis Gagie,Dominik Köppl,Massimiliano Rossi

Journal

String Processing and Information Retrieval: 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26–28, 2023, Proceedings

Published Date

2023/9/19

The positional Burrows-Wheeler Transform (PBWT) was presented as a means to find set-maximal exact matches (SMEMs) in haplotype data via the computation of the divergence array. Although run-length encoding the PBWT has been previously considered, storing the divergence array along with the PBWT in a compressed manner has not been as rigorously studied. We define two queries that can be used in combination to compute SMEMs, allowing us to define smaller data structures that support one or both of these queries. We combine these data structures, enabling the PBWT and the divergence array to be stored in a manner that allows for finding SMEMs. We estimate and compare the memory usage of these data structures, leading to one data structure that is most memory efficient. Lastly, we implement this data structure and compare its performance to prior methods using various datasets taken from the 1000 Genomes Project data.

A simple grammar-based index for finding approximately longest common substrings

Authors

Travis Gagie,Sana Kashgouli,Gonzalo Navarro

Published Date

2023/9/20

We show how, given positive constants \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta $$\end{document}, and an \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document …

Sum-of-Local-Effects Data Structures for Separable Graphs

Authors

Xing Lyu,Travis Gagie,Meng He,Yakov Nekrich,Norbert Zeh

Published Date

2023/12/9

It is not difficult to think of applications that can be modelled as graph problems in which placing some facility or commodity at a vertex has some positive or negative effect on the values of all the vertices out to some distance, and we want to be able to calculate quickly the cumulative effect on any vertex’s value at any time or the list of the most beneficial or most detrimential effects on a vertex. In this paper we show how, given an edge-weighted graph with constant-size separators, we can support the following operations in time polylogarithmic in the number of vertices and the number of facilities placed on the vertices, where distances between vertices are measured with respect to edge weights:Add\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek …

Augmented thresholds for moni

Authors

César Martínez-Guardiola,Nathaniel K Brown,Fernando Silva-Coira,Dominik Köppl,Travis Gagie,Susana Ladra

Published Date

2023/3/21

MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these queries which significantly speeds up MONI in practice while only slightly increasing its size.

μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data

Authors

Davide Cozzi,Massimiliano Rossi,Simone Rubinacci,Travis Gagie,Dominik Köppl,Christina Boucher,Paola Bonizzoni

Journal

Bioinformatics

Published Date

2023/9/1

Motivation The Positional Burrows–Wheeler Transform () is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. Results In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as , and evaluate it on datasets of 1000 Genome Project and UK Biobank …

Dynamic Compact Planar Embeddings

Authors

Travis Gagie,Meng He,Michael St Denis

Published Date

2023/9/20

This paper presents a way to compactly represent dynamic connected planar embeddings, which may contain self loops and multi-edges, in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4m + o(m)$$\end{document} bits, to support basic navigation in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\lg n)$$\end{document} time and edge and vertex insertion and deletion in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym …

r-indexing without backward searching

Authors

Omar Ahmed,Andrej Baláž,Nathaniel K Brown,Lore Depuydt,Adrián Goga,Alessia Petescia,Mohsen Zakeri,Jan Fostier,Travis Gagie,Ben Langmead,Gonzalo Navarro,Nicola Prezza

Journal

arXiv preprint arXiv:2312.01359

Published Date

2023/12/3

Suppose we are given a text of length and a straight-line program for with rules. Let be the number of runs in the Burrows-Wheeler Transform of the reverse of . We can index in space such that, given a pattern and constant-time access to the Karp-Rabin hashes of the substrings of and the reverse of , we can find the maximal exact matches of with respect to correctly with high probability and using time for each edge we would descend in the suffix tree of while finding those matches.

Ruler wrapping

Authors

Travis Gagie,Mozhgan Saeidi,Allan Sapucaia

Journal

International Journal of Computational Geometry & Applications

Published Date

2023/3/19

In 1985 Hopcroft, Joseph and Whitesides showed it is NP-complete to decide whether a carpenter’s ruler with segments of given positive lengths can be folded into an interval of at most a given length, such that the folded hinges alternate between 180 degrees clockwise and 180 degrees counter-clockwise. At the open-problem session of 33rd Canadian Conference on Computational Geometry (CCCG ’21), O’Rourke proposed a natural variation of this problem called ruler wrapping, in which all folded hinges must be folded the same way. In this paper we show O’Rourke’s variation has a linear-time solution.

Another virtue of wavelet forests?

Authors

Christina Boucher,Travis Gagie,Aaron Hong,Yansong Li,Norbert Zeh

Journal

arXiv preprint arXiv:2308.07809

Published Date

2023/8/15

A wavelet forest for a text over an alphabet takes bits of space and supports access and rank on in time. K\"arkk\"ainen and Puglisi (2011) implicitly introduced wavelet forests and showed that when is the Burrows-Wheeler Transform (BWT) of a string , then a wavelet forest for occupies space bounded in terms of higher-order empirical entropies of even when the forest is implemented with uncompressed bitvectors. In this paper we show experimentally that wavelet forests also have better access locality than wavelet trees and are thus interesting even when higher-order compression is not effective on , or when is not a BWT at all.

Space-time Trade-offs for the LCP Array of Wheeler DFAs

Authors

Nicola Cotumaccio,Travis Gagie,Dominik Köppl,Nicola Prezza

Published Date

2023/9/20

Recently, Conte et al. generalized the longest-common prefix (LCP) array from strings to Wheeler DFAs, and they showed that it can be used to efficiently determine matching statistics on a Wheeler DFA [DCC 2023]. However, storing the LCP array requires \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ O(n \log n) $$\end{document} bits, n being the number of states, while the compact representation of Wheeler DFAs often requires much less space. In particular, the BOSS representation of a de Bruijn graph only requires a linear number of bits, if the size of alphabet is constant.In this paper, we propose a sampling technique that allows to access an entry of the LCP array in logarithmic time …

Graph compression for adjacency-matrix multiplication

Authors

Alexandre P Francisco,Travis Gagie,Dominik Köppl,Susana Ladra,Gonzalo Navarro

Journal

SN Computer Science

Published Date

2022/5

Computing the product of the (binary) adjacency matrix of a large graph with a real-valued vector is an important operation that lies at the heart of various graph analysis tasks, such as computing PageRank. In this paper, we show that some well-known webgraph and social graph compression formats are computation-friendly, in the sense that they allow boosting the computation. We focus on the compressed representations of (a) Boldi and Vigna and (b) Hernández and Navarro, and show that the product computation can be conducted in time proportional to the compressed graph size. Our experimental results show speedups of at least 2 on graphs that were compressed at least 5 times with respect to the original.

Spumoni 2: Improved pangenome classification using a compressed index of minimizer digests

Authors

Omar Ahmed,Massimiliano Rossi,Travis Gagie,Christina Boucher,Ben Langmead

Journal

bioRxiv

Published Date

2022/9/11

Genomics analyses often use a large sequence collection as a reference, like a pangenome or taxonomic database. We previously described SPUMONI, which performs binary classification of nanopore reads using pangenomic matching statistics. Here we describe SPUMONI 2, an improved version that is faster, more memory efficient, works effectively for both short and long reads, and can solve multi-class classification problems with the aid of a novel sampled document array structure. By incorporating minimizers, SPUMONI 2 reduces index size by a factor of 2 compared to SPUMONI, yielding an index more than 65 times smaller than minimap2’s for a mock community pangenome. SPUMONI 2 also achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency for short and long reads, including in an adaptive sampling scenario. We further demonstrate that SPUMONI 2 can detect contaminated contigs in genome assemblies, and can perform multi-class metagenomic read classification.

On representing the degree sequences of sublogarithmic-degree Wheeler graphs

Authors

Travis Gagie

Published Date

2022/11/1

We show how to store a searchable partial-sums data structure with constant query time for a static sequence S of n positive integers in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$o \left( \frac{\log n}{(\log \log n)^2} \right) $$\end{document}, in bits for \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \in o \left( \frac{\log n}{(\log \log n)^2} \right) $$\end{document}. It follows that if a Wheeler graph on n vertices has maximum degree in \documentclass[12pt]{minimal …

MONI: a pangenomic index for finding maximal exact matches

Authors

Massimiliano Rossi,Marco Oliva,Ben Langmead,Travis Gagie,Christina Boucher

Journal

Journal of Computational Biology

Published Date

2022/2/1

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding—but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse …

Improving matrix-vector multiplication via lossless grammar-compressed matrices

Authors

Paolo Ferragina,Travis Gagie,Dominik Köppl,Giovanni Manzini,Gonzalo Navarro,Manuel Striani,Francesco Tosoni

Journal

arXiv preprint arXiv:2203.14540

Published Date

2022/3/28

As nowadays Machine Learning (ML) techniques are generating huge data collections, the problem of how to efficiently engineer their storage and operations is becoming of paramount importance. In this article we propose a new lossless compression scheme for real-valued matrices which achieves efficient performance in terms of compression ratio and time for linear-algebra operations. Experiments show that, as a compressor, our tool is clearly superior to gzip and it is usually within 20% of xz in terms of compression ratio. In addition, our compressed format supports matrix-vector multiplications in time and space proportional to the size of the compressed representation, unlike gzip and xz that require the full decompression of the compressed matrix. To our knowledge our lossless compressor is the first one achieving time and space complexities which match the theoretical limit expressed by the -th order statistical entropy of the input. To achieve further time/space reductions, we propose column-reordering algorithms hinging on a novel column-similarity score. Our experiments on various data sets of ML matrices show that, with a modest preprocessing time, our column reordering can yield a further reduction of up to 16% in the peak memory usage during matrix-vector multiplication. Finally, we compare our proposal against the state-of-the-art Compressed Linear Algebra (CLA) approach showing that ours runs always at least twice faster (in a multi-thread setting) and achieves better compressed space occupancy for most of the tested data sets. This experimentally confirms the provably effective theoretical bounds we show for our …

Teaching the Burrows-Wheeler Transform via the Positional Burrows-Wheeler Transform

Authors

Travis Gagie,Giovanni Manzini,Marinella Sciortino

Journal

arXiv preprint arXiv:2208.09840

Published Date

2022/8/21

The Burrows-Wheeler Transform (BWT) is often taught in undergraduate courses on algorithmic bioinformatics, because it underlies the FM-index and thus important tools such as Bowtie and BWA. Its admirers consider the BWT a thing of beauty but, despite thousands of pages being written about it over nearly thirty years, to undergraduates seeing it for the first time it still often seems like magic. Some who persevere are later shown the Positional BWT (PBWT), which was published twenty years after the BWT. In this paper we argue that the PBWT should be taught {\em before} the BWT. We first use the PBWT's close relation to a right-to-left radix sort to explain how to use it as a fast and space-efficient index for {\em positional search} on a set of strings (that is, given a pattern and a position, quickly list the strings containing that pattern starting in that position). We then observe that {\em prefix search} (listing all the strings that start with the pattern) is an easy special case of positional search, and that prefix search on the suffixes of a single string is equivalent to {\em substring search} in that string (listing all the starting positions of occurrences of the pattern in the string). Storing na\"ively a PBWT of the suffixes of a string is space-{\em inefficient} but, in even reasonably small examples, most of its columns are nearly the same. It is not difficult to show that if we store a PBWT of the cyclic shifts of the string, instead of its suffixes, then all the columns are exactly the same -- and equal to the BWT of the string. Thus we can teach the BWT and the FM-index via the PBWT.

KATKA: A KRAKEN-Like Tool with k Given at Query Time

Authors

Travis Gagie,Sana Kashgouli,Ben Langmead

Published Date

2022/11/1

We describe a new tool, KATKA, that stores a phylogenetic tree T such that later, given a pattern P[1..m] and an integer k, it can quickly return the root of the smallest subtree of T containing all the genomes in which the k-mer occurs, for . This is similar to KRAKEN’s functionality but with k given at query time instead of at construction time.

Towards Deep and Interpretable Rule Learning

Authors

Johannes Fürnkranz

Published Date

2022

Folie 1 Page 1 CAIML Seminar | TU Wien | J. Fürnkranz | 1 Towards Deep and Interpretable Rule Learning Johannes Fürnkranz Johannes Kepler University, Linz Institute for Application-Oriented Knowledge Processing Computational Data Analytics Group juffi@faw.jku.at Joint Work with Florian Beck, Van Quoc Phuong Hyunh, Tomas Kliegr et al. Page 2 CAIML Seminar | TU Wien | J. Fürnkranz | 2 Towards Deep (and Interpretable?) Rule Learning Johannes Fürnkranz Johannes Kepler University, Linz Institute for Applied Knowledge Processing Computational Data Analytics Group juffi@faw.jku.at Joint Work with Florian Beck, Van Quoc Phuong Hyunh, Tomas Kliegr et al. Page 3 CAIML Seminar | TU Wien | J. Fürnkranz 4 AI and (Lack of) Interpretability ▪ Many AI systems can produce good performance ▪ but cannot explain their decisions (→ “black-box models”) ▪ Example: ▪ When Kasparov lost a crucial game …

CSTs for Terabyte-Sized Data

Authors

Marco Oliva,Davide Cenzato,Massimiliano Rossi,Zsuzsanna Lipták,Travis Gagie,Christina Boucher

Published Date

2022/3/22

Generating pangenomic datasets is becoming increasingly common but there are still few tools able to handle them and even fewer accessible to non-specialists. Building compressed suffix trees (CSTs) for pangenomic datasets is still a major challenge but could be enor-mously beneficial to the community. In this paper, we present a method, which we refer to as Repfp-cst, for building CSTs in a manner that is scalable. To accomplish this, we show how to build a CST directly from VCF files without decompressing them, and to prune from the prefix-free parse (PFP) phrase boundaries whose removal reduces the total size of the dictionary and the parse. We show that these improvements reduce the time and space required for the construction of the CST, and the memory footprint of the finished CST, enabling us to build a CST for a terabyte of DNA for the first time in the literature.

MONI-k: An index for efficient pangenome-to-pangenome comparison

Authors

Travis Gagie

Journal

bioRxiv

Published Date

2022/8/11

Maximal exact matches (MEMs) are widely used in bioinformatics, originally for genome-to-genome comparison but especially for DNA alignment ever since Li (2013) presented BWA-MEM. Building on work by Bannai, Gagie and I (2018) and again targeting alignment, Rossi et al. (2022) recently built an index called MONI that is based on the run-length compressed Burrows-Wheeler Transform and can find MEMs efficiently with respect to pangenomes.In this paper we define k-MEMs to be maximal substrings of a pattern that each occur exactly at least k times in a text (so a MEM is a 1-MEM) and briefly explain why computing k-MEMs could be useful for pangenome-to-pangenome comparison. We then show that, when k is given at construction time, MONI can easily be extended to find k-MEMs efficiently as well.

Rectangular Ruler Wrapping

Authors

Xing Lyu,Travis Gagie,Meng He

Journal

arXiv e-prints

Published Date

2022/10

We define {\sc Rectangular Ruler Wrapping} as a natural variant of the {\sc Ruler Wrapping} problem proposed by O'Rourke at CCCG'21, and give a simple, online and quadratic-time algorithm for it, under the simplifying assumption that the last segment must extend strictly beyond every other in the relevant direction.

Efficient and compact representations of some non-canonical prefix-free codes

Authors

Antonio Fariña,Travis Gagie,Szymon Grabowski,Giovanni Manzini,Gonzalo Navarro,Alberto Ordóñez

Journal

Theoretical Computer Science

Published Date

2022/3/12

For many kinds of prefix-free codes there are efficient and compact alternatives to the traditional tree-based representation. Since these put the codes into canonical form, however, they can only be used when we can choose the order in which codewords are assigned to symbols. In this paper we first show how, given a probability distribution over an alphabet of σ symbols, we can store an optimal alphabetic prefix-free code in O (σ lg⁡ L) bits such that we can encode and decode any codeword of length ℓ in O (min⁡(ℓ, lg⁡ L)) time, where L is the maximum codeword length. With O (2 L ϵ) further bits, for any constant ϵ> 0, we can encode and decode O (lg⁡ ℓ) time. We then show how to store a nearly optimal alphabetic prefix-free code in o (σ) bits such that we can encode and decode in constant time. We also consider a kind of optimal prefix-free code introduced recently where the codewords' lengths are non …

Syotti: scalable bait design for DNA enrichment

Authors

Jarno N Alanko,Ilya B Slizovskiy,Daniel Lokshtanov,Travis Gagie,Noelle R Noyes,Christina Boucher

Journal

Bioinformatics

Published Date

2022/7/1

Motivation Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes (‘baits’) are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. Results We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time …

Compressed data structures for population-scale Positional Burrows–Wheeler Transforms

Authors

Paola Bonizzoni,Christina Boucher,Davide Cozzi,Travis Gagie,Sana Kashgouli,Dominik Köppl,Massimiliano Rossi

Journal

bioRxiv

Published Date

2022/9/19

AbstractThe positional Burrows–Wheeler Transform (PBWT) was presented in 2014 by Durbin as a means to find all maximal haplotype matches in h sequences containing w variation sites in ????(hw)-time. This time complexity of finding maximal haplotype matches using the PBWT is a significant improvement over the naïve pattern-matching algorithm that requires ????(h2w)-time. Compared to the more famous Burrows-Wheeler Transform (BWT), however, a relatively little amount of attention has been paid to the PBWT. This has resulted in less space-efficient data structures for building and storing the PBWT. Given the increasing size of available haplotype datasets, and the applicability of the PBWT to pangenomics, the time is ripe for identifying efficient data structures that can be constructed for large datasets. Here, we present a comprehensive study of the memory footprint of data structures supporting maximal haplotype matching in conjunction with the PBWT. In particular, we present several data structure components that act as building blocks for constructing six different data structures that store the PBWT in a manner that supports efficiently finding the maximal haplotype matches. We estimate the memory usage of the data structures by bounding the space usage with respect to the input size. In light of this experimental analysis, we implement the solutions that are deemed to be superior with respect to the memory usage and show the performance on haplotype datasets taken from the 1000 Genomes Project data.

A fast and simple -space index for finding approximately longest common substrings

Authors

Nick Fagan,Jorge Hermo González,Travis Gagie

Journal

arXiv preprint arXiv:2211.13434

Published Date

2022/11/24

We describe how, given a text and a positive constant , we can build a simple -space index, where is the number of phrases in the LZ77 parse of , such that later, given a pattern , in time and with high probability we can find a substring of that occurs in and whose length is at least a -fraction of the length of a longest common substring of and .

MONI can find k-MEMs

Authors

Igor Tatarnikov,Ardavan Shahrabi Farahani,Sana Kashgouli,Travis Gagie

Journal

arXiv preprint arXiv:2202.05085

Published Date

2022/2/10

Suppose we are asked to index a text such that, given a pattern , we can quickly report the maximal substrings of that each occur in at least times. We first show how we can add bits to Rossi et al.'s recent MONI index, where is the number of runs in the Burrows-Wheeler Transform of , such that it supports such queries in time. We then show how, if we are given at construction time, we can reduce the query time to .

Preface to Special Issue for DCC 2020

Authors

Travis Gagie

Published Date

2022/5

Preface to Special Issue for DCC 2020 | Information and Computation skip to main content ACM Digital Library home ACM home Google, Inc. (search) Advanced Search Browse About Sign in Register Advanced Search Journals Magazines Proceedings Books SIGs Conferences People More Search ACM Digital Library SearchSearch Advanced Search Information and Computation Periodical Home Latest Issue Archive Authors Affiliations Award Winners More Home Browse by Title Periodicals Information and Computation Vol. 285, No. PB Preface to Special Issue for DCC 2020 editorial Share on Preface to Special Issue for DCC 2020 Author: Travis Gagie Halifax, Canada Halifax, Canada View Profile Authors Info & Claims Information and ComputationVolume 285Issue PBMay 2022https://doi.org/10.1016/j.ic.2022.104880 Published:01 May 2022Publication History 0citation 0 Downloads Metrics Total Citations0 …

MARIA: Multiple-alignment -index with aggregation

Authors

Adrián Goga,Andrej Baláž,Alessia Petescia,Travis Gagie

Journal

arXiv preprint arXiv:2209.09218

Published Date

2022/9/19

There now exist compact indexes that can efficiently list all the occurrences of a pattern in a dataset consisting of thousands of genomes, or even all the occurrences of all the pattern's maximal exact matches (MEMs) with respect to the dataset. Unless we are lucky and the pattern is specific to only a few genomes, however, we could be swamped by hundreds of matches -- or even hundreds per MEM -- only to discover that most or all of the matches are to substrings that occupy the same few columns in a multiple alignment. To address this issue, in this paper we present a simple and compact data index MARIA that stores a multiple alignment such that, given the position of one match of a pattern (or a MEM or other substring of a pattern) and its length, we can quickly list all the distinct columns of the multiple alignment where matches start.

Space-efficient RLZ-to-LZ77 conversion

Authors

Travis Gagie

Journal

arXiv preprint arXiv:2211.13254

Published Date

2022/11/23

Consider a text prefixed by a reference sequence . We show how, given and the -phrase relative Lempel-Ziv parse of with respect to , we can build the LZ77 parse of in time and total space.

Finding maximal exact matches using the r-index

Authors

Massimiliano Rossi,Marco Oliva,Paola Bonizzoni,Ben Langmead,Travis Gagie,Christina Boucher

Journal

Journal of Computational Biology

Published Date

2022/2/1

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in space that supports efficient MEM finding, where r is the number of runs in the Burrows–Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.

PFP Compressed Suffix Trees∗

Authors

Christina Boucher,Ondřej Cvacho,Travis Gagie,Jan Holub,Giovanni Manzini,Gonzalo Navarro,Massimiliano Rossi

Published Date

2021

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string S, it produces a dictionary D and a parse P of overlapping phrases such that BWT(S) can be computed from D and P in time and workspace bounded in terms of their combined size |PFP(S)|. In practice D and P are significantly smaller than S and computing BWT(S) from them is more efficient than computing it from S directly, at least when S is the concatenation of many genomes. In this paper, we consider PFP(S) as a data structure and show how it can be augmented to support full suffix tree functionality, still built and fitting within O(|PFP(S)|) space. This entails the efficient computation of various primitives to simulate the suffix tree: computing a longest common extension (LCE) of two positions in S; reading any cell of its …

Efficiently Merging r-indexes

Authors

Marco Oliva,Massimiliano Rossi,Jouni Sirén,Giovanni Manzini,Tamer Kahveci,Travis Gagie,Christina Boucher

Published Date

2021/3/23

Large sequencing projects, such as GenomeTrakr and MetaSub, are updated frequently (sometimes daily, in the case of GenomeTrakr) with new data. Therefore, it is imperative that any data structure indexing such data supports efficient updates. Toward this goal, Bannai et al. (TCS, 2020) proposed a data structure named dynamic r-index which is suitable for large genome collections and supports incremental construction; however, it is still not powerful enough to support substantial updates. Here, we develop a novel algorithm for updating the r-index, which we refer to as RIMERGE. Fundamental to our algorithm is the combination of the basics of the dynamic r-index with a known algorithm for merging Burrows-Wheeler Transforms (BWTs). As a result, RIMERGE is capable of performing batch updates in a manner that exploits parallelism while keeping the memory overhead small. We compare our method to the …

An index for moving objects with constant-time access to their compressed trajectories

Authors

Nieves R Brisaboa,Travis Gagie,Adrián Gómez-Brandón,Gonzalo Navarro,José R Paramá

Journal

International Journal of Geographical Information Science

Published Date

2021/7/3

As the number of vehicles and devices equipped with GPS technology has grown explosively, an urgent need has arisen for time- and space-efficient data structures to represent their trajectories. The most commonly desired queries are the following: queries about an object’s trajectory, range queries, and nearest neighbor queries. In this paper, we consider that the objects can move freely and we present a new compressed data structure for storing their trajectories, based on a combination of logs and snapshots, with the logs storing sequences of the objects’ relative movements and the snapshots storing their absolute positions sampled at regular time intervals. We call our data structure ContaCT because it provides Constant- time access to Compressed Trajectories. Its logs are based on a compact partial-sums data structure that returns cumulative displacement in constant time, and allows us to compute in …

PHONI: Streamed matching statistics with multi-genome references

Authors

Christina Boucher,Travis Gagie,I Tomohiro,Dominik Köppl,Ben Langmead,Giovanni Manzini,Gonzalo Navarro,Alejandro Pacheco,Massimiliano Rossi

Published Date

2021/3/23

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. Our code is available at https://github.com/koeppl/phoni.

Pan-genomic matching statistics for targeted nanopore sequencing

Authors

Omar Ahmed,Massimiliano Rossi,Sam Kovaka,Michael C Schatz,Travis Gagie,Christina Boucher,Ben Langmead

Journal

Iscience

Published Date

2021/6/25

Nanopore sequencing is an increasingly powerful tool for genomics. Recently, computational advances have allowed nanopores to sequence in a targeted fashion; as the sequencer emits data, software can analyze the data in real time and signal the sequencer to eject "nontarget" DNA molecules. We present a novel method called SPUMONI, which enables rapid and accurate targeted sequencing using efficient pan-genome indexes. SPUMONI uses a compressed index to rapidly generate exact or approximate matching statistics in a streaming fashion. When used to target a specific strain in a mock community, SPUMONI has similar accuracy as minimap2 when both are run against an index containing many strains per species. However SPUMONI is 12 times faster than minimap2. SPUMONI's index and peak memory footprint are also 16 to 4 times smaller than those of minimap2, respectively. This could enable …

-indexing Wheeler graphs

Authors

Travis Gagie

Journal

arXiv preprint arXiv:2101.12341

Published Date

2021/1/29

Let be a Wheeler graph and be the number of runs in a Burrows-Wheeler Transform of , and suppose can be decomposed into edge-disjoint directed paths whose internal vertices each have in- and out-degree exactly 1. We show how to store in space such that later, given a pattern , in time we can count the vertices of reachable by directed paths labelled , and then report those vertices in time per vertex.

Succinct Euler-Tour Trees

Authors

Travis Gagie,Sebastian Wild

Journal

arXiv preprint arXiv:2105.04965

Published Date

2021/5/11

We show how a collection of Euler-tour trees for a forest on vertices can be stored in bits such that simple queries take constant time, more complex queries take logarithmic time and updates take polylogarithmic amortized time.

RLBWT tricks

Authors

Nathaniel K Brown,Travis Gagie,Massimiliano Rossi

Journal

arXiv preprint arXiv:2112.04271

Published Date

2021/12/8

Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation , it stores an -space table -- where is the number of positions where either or -- that enables the computation of successive values of by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing is constant while maintaining -space. In this paper we refine Nishimoto and Tabei's approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing …

Lecture Notes for 3110: Design and Analysis of Algorithms

Authors

Travis Gagie

Published Date

2021

Lecture notes for the course CSCI 3110 ("Design and Analysis of Algorithms") offered at Dalhousie during the summer term of 2021.

Block trees

Authors

Djamal Belazzougui,Manuel Cáceres,Travis Gagie,Paweł Gawrychowski,Juha Kärkkäinen,Gonzalo Navarro,Alberto Ordóñez,Simon J Puglisi,Yasuo Tabei

Journal

Journal of Computer and System Sciences

Published Date

2021/5/1

Abstract Let string S [1.. n] be parsed into z phrases by the Lempel-Ziv algorithm. The corresponding compression algorithm encodes S in O (z) space, but it does not support random access to S. We introduce a data structure, the block tree, that represents S in O (z log⁡(n/z)) space and extracts any symbol of S in time O (log⁡(n/z)), among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of S. Further, block trees can be built in linear time and in a scalable manner. Our experiments show that block trees offer relevant space-time tradeoffs compared to other compressed string representations for highly repetitive strings.

Simple Worst-Case Optimal Adaptive Prefix-Free Coding

Authors

Travis Gagie

Journal

arXiv preprint arXiv:2109.02997

Published Date

2021/9/7

Gagie and Nekrich (2009) gave an algorithm for adaptive prefix-free coding that, given a string over the alphabet with , encodes in at most bits, where is the empirical entropy of , such that encoding and decoding take time. They also proved their bound on the encoding length is optimal, even when the empirical entropy is high. Their algorithm is impractical, however, because it uses complicated data structures. In this paper we give an algorithm with the same bounds, except that we require , that uses no data structures more complicated than a lookup table. Moreover, when Gagie and Nekrich's algorithm is used for optimal adaptive alphabetic coding it takes time for decoding, but ours still takes time.

Buffering updates enables efficient dynamic de Bruijn graphs

Authors

Jarno Alanko,Bahar Alipanahi,Jonathen Settle,Christina Boucher,Travis Gagie

Journal

Computational and structural biotechnology journal

Published Date

2021/1/1

Motivation: The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. Results: With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit …

A fast and small subsampled r-index

Authors

Dustin Cobas,Travis Gagie,Gonzalo Navarro

Journal

arXiv preprint arXiv:2103.15329

Published Date

2021/3/29

The -index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, where is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the -index, a variant that limits the space to for a text of length and a given parameter , at the expense of multiplying by the time per occurrence reported. The -index is obtained by carefully subsampling the text positions indexed by the -index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the -index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the -index while using 1.5--3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the -index, using about half the space, but they are an order of magnitude slower.

MONI: A pangenomics index for finding MEMs

Authors

Massimiliano Rossi,Marco Oliva,Ben Langmead,Travis Gagie,Christina Boucher

Journal

bioRxiv

Published Date

2021/7/7

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding — but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large sequence collections of highly repetitive sequences. Compared to other read aligners – PuffAligner, Bowtie2, BWA-MEM, and CHIC – MONI used 2–11 times less memory and was 2–32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.AvailabilityMONI is publicly available at https://github.com/maxrossi91/moni.

Fast and compact planar embeddings

Authors

Leo Ferres,José Fuentes-Sepúlveda,Travis Gagie,Meng He,Gonzalo Navarro

Journal

Computational Geometry

Published Date

2020/8/1

There are many representations of planar graphs, but few are as elegant as Turán's (1984): it is simple and practical, uses only 4 bits per edge, can handle self-loops and multi-edges, and can store any specified embedding. Its main disadvantage has been that “it does not allow efficient searching” (Jacobson, 1989). In this paper we show how to add a sublinear number of bits to Turán's representation such that it supports fast navigation while retaining simplicity. As a consequence of the inherited simplicity, we offer the first efficient parallel construction of a compact encoding of a planar graph embedding. Our experimental results show that the resulting representation uses about 6 bits per edge in practice, supports basic navigation operations within a few microseconds, and can be built sequentially at a rate below 1 microsecond per edge, featuring a linear speedup with a parallel efficiency around 50% for large …

Decompressing lempel-ziv compressed text

Authors

Philip Bille,Mikko Berggren Ettienne,Travis Gagie,Inge Li Gørtz,Nicola Prezza

Published Date

2020/3/24

We consider the problem of decompressing the Lempel-Ziv 77 representation of a string S of length n using a working space as close as possible to the size z of the input. The folklore solution for the problem runs in O(n) time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size O(z log(n/z)) and then stream S in linear time. In this paper, we show that O(n) time and O(z) working space can be achieved for constant-size alphabets. On general alphabets of size σ, we describe (i) a trade-off achieving O(n log δ σ) time and O(z log 1-δ σ) space for any 0≤ δ≤ 1, and (ii) a solution achieving O(n) time and O(z log log (n/z)) space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of S with little overheads on top of the linear running time and …

Compressed dynamic range majority and minority data structures

Authors

Travis Gagie,Meng He,Gonzalo Navarro

Journal

Algorithmica

Published Date

2020/7

In the range -majority query problem, we are given a sequence and a fixed threshold , and are asked to preprocess S such that, given a query range , we can efficiently report the symbols that occur more than times in , which are called the range -majorities. In this article we describe the first compressed dynamic data structure for range -majority queries. It represents S in compressed space— bits for any , where is the alphabet size and is the kth order empirical entropy of S—and answers queries in time while supporting insertions and deletions in S in amortized time. We then show how to modify our data structure to receive some at query time and report the range -majorities in time, without increasing the asymptotic space or update-time bounds. The best previous dynamic solution has the same query and update times as ours, but it occupies O(n) words and cannot take …

More time-space tradeoffs for finding a shortest unique substring

Authors

Hideo Bannai,Travis Gagie,Gary Hoppenworth,Simon J Puglisi,Luís MS Russo

Journal

Algorithms

Published Date

2020/9/18

We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of finding k-mismatch SUSs. Our new results include the first algorithm for finding a k-mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or in O(n(L/m)log2(L)loglogσ) time using O((L/m)log2L) sequential passes over T. We then describe how, for constant k, with high probability, we can find a k-mismatch SUS in O(n1+ϵL/m) time using O(nϵL/m) sequential passes over T, again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O(nτlogσlogn) time to find an SUS using O(n/τ) words of workspace, where τ is a parameter.

Fully functional suffix trees and optimal text searching in BWT-runs bounded space

Authors

Travis Gagie,Gonzalo Navarro,Nicola Prezza

Journal

Journal of the ACM (JACM)

Published Date

2020/1/15

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the …

PFP Data Structures

Authors

Christina Boucher,Ondřej Cvacho,Travis Gagie,Jan Holub,Giovanni Manzini,Gonzalo Navarro,Massimiliano Rossi

Journal

arXiv preprint arXiv:2006.11687

Published Date

2020/6/21

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string , it produces a dictionary and a parse of overlapping phrases such that can be computed from and in time and workspace bounded in terms of their combined size . In practice and are significantly smaller than and computing from them is more efficient than computing it from directly, at least when consists of genomes from individuals of the same species. In this paper, we consider as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for variants of human chromosome 19, initially occupying roughly 56 GB.

Practical random access to SLP-compressed texts

Authors

Travis Gagie,Tomohiro I,Giovanni Manzini,Gonzalo Navarro,Hiroshi Sakamoto,Louisa Seelbach Benkner,Yoshimasa Takabatake

Published Date

2020/9/17

Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as genomic databases. In a recent paper (SPIRE 2019) we showed how simple pre-processing can dramatically improve those trade-offs, and in this paper we turn our attention to one of the features that make grammar-based compression so attractive: the possibility of supporting fast random access. This is an essential primitive in many algorithms that process grammar-compressed texts without decompressing them and so many theoretical bounds have been published about it, but experimentation has lagged behind. We give a new encoding of grammars that is about as small as the practical state of the art (Maruyama et al., SPIRE 2013) but with significantly faster queries.

Refining the r-index

Authors

Hideo Bannai,Travis Gagie,I Tomohiro

Journal

Theoretical Computer Science

Published Date

2020/4/6

Gagie, Navarro and Prezza's r-index (SODA, 2018) promises to speed up DNA alignment and variation calling by allowing us to index entire genomic databases, provided certain obstacles can be overcome. In this paper we first strengthen and simplify Policriti and Prezza's Toehold Lemma (DCC '16; Algorithmica, 2017), which inspired the r-index and plays an important role in its implementation. We then show how to update the r-index efficiently after adding a new genome to the database, which is likely to be vital in practice. As a by-product of this result, we obtain an online version of Policriti and Prezza's algorithm for constructing the LZ77 parse from a run-length compressed Burrows-Wheeler Transform. Our experiments demonstrate the practicality of all three of these results. Finally, we show how to augment the r-index such that, given a new genome and fast random access to the database, we can quickly …

Tree path majority data structures

Authors

Travis Gagie,Meng He,Gonzalo Navarro,Carlos Ochoa

Journal

Theoretical Computer Science

Published Date

2020/9/12

We present the first solution to finding τ-majorities on tree paths. Given a tree of n nodes, each with a label from [1.. σ], and a fixed threshold 0< τ< 1, such a query gives two nodes u and v and asks for all the labels that appear more than τ⋅| P u v| times in the path P u v from u to v, where| P u v| denotes the number of nodes in P u v. Note that the answer to any query is of size up to 1/τ. On a w-bit RAM, we obtain a linear-space data structure with O ((1/τ) lg⁡ lg w⁡ σ) query time, which is worst-case optimal for polylogarithmic-sized alphabets. We also describe two succinct-space solutions with query time O ((1/τ) lg⁎⁡ n lg⁡ lg w⁡ σ). One uses 2 n H+ 4 n+ o (n)(H+ 1) bits, where H≤ lg⁡ σ is the entropy of the label distribution; the other uses n H+ O (n)+ o (n H) bits. By using just o (n lg⁡ σ) extra bits, our succinct structures allow τ to be specified at query time. We obtain analogous results to find a τ-minority, that is, an …

Matching Reads to Many Genomes with the r-Index

Authors

Taher Mun,Alan Kuhnle,Christina Boucher,Travis Gagie,Ben Langmead,Giovanni Manzini

Journal

Journal of Computational Biology

Published Date

2020/4/1

The r-index is a tool for compressed indexing of genomic databases for exact pattern matching, which can be used to completely align reads that perfectly match some part of a genome in the database or to find seeds for reads that do not. This article shows how to download and install the programs ri-buildfasta and ri-align; how to call ri-buildfasta on an FASTA file to build an r-index for that file; and how to query that index with ri-align.

Computation over compressed data

Authors

Travis Gagie,Gonzalo Navarro

Published Date

2020/8/1

Computing with compressed data is in the middle of some important anniversaries: 2019 marked thirty years since Jacobson's seminal FOCS paper on compact data structures for trees and planar graphs, and twenty-five since the publication of the Burrows-Wheeler Transform; 2020 marks the twenty-five years since Farach and Thorup's STOC paper on compressed pattern matching, and twenty since Grossi and Vitter's and Ferragina and Manzini's STOC and FOCS papers on compressed indexing. Our field has come a long way in just a couple of decades: we started off worrying about managing gigabytes and now we want to tame terabytes! Our growing community is still sufficiently close-knit for many of us to be on a first-name basis but already we have a hefty textbook, regular special sessions, and now a third special journal issue.This issue contains extended versions of five papers selected from those …

Efficient construction of a complete index for pan-genomics read alignment

Authors

Alan Kuhnle,Taher Mun,Christina Boucher,Travis Gagie,Ben Langmead,Giovanni Manzini

Journal

Journal of Computational Biology

Published Date

2020/4/1

Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows–Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that—when used with the rank data structure—allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic …

See List of Professors in Travis Gagie University(Dalhousie University)

Travis Gagie FAQs

What is Travis Gagie's h-index at Dalhousie University?

The h-index of Travis Gagie has been 21 since 2020 and 31 in total.

What are Travis Gagie's top articles?

The articles with the titles of

Faster MEM-finding in space

Stronger compact representations of object trajectories

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests

Pfp-fm: an accelerated FM-index

Wheeler Maps

Space-efficient conversions from SLPs

Faster Maximal Exact Matches with Lazy LCP Evaluation

Faster compressed quadtrees

...

are the top articles of Travis Gagie at Dalhousie University.

What are Travis Gagie's research interests?

The research interests of Travis Gagie are: data structures, data compression

What is Travis Gagie's total number of citations?

Travis Gagie has 3,353 citations in total.

What are the co-authors of Travis Gagie?

The co-authors of Travis Gagie are Gonzalo Navarro, Ben Langmead, Simon J. Puglisi, Veli Mäkinen, Paweł Gawrychowski, Christina Boucher.

    Co-Authors

    H-index: 79
    Gonzalo Navarro

    Gonzalo Navarro

    Universidad de Chile

    H-index: 39
    Ben Langmead

    Ben Langmead

    Johns Hopkins University

    H-index: 39
    Simon J. Puglisi

    Simon J. Puglisi

    Helsingin yliopisto

    H-index: 39
    Veli Mäkinen

    Veli Mäkinen

    Helsingin yliopisto

    H-index: 29
    Paweł Gawrychowski

    Paweł Gawrychowski

    Uniwersytet Wroclawski

    H-index: 27
    Christina Boucher

    Christina Boucher

    University of Florida

    academic-engine

    Useful Links