We provide here a non-exhaustive list of implementations of the algorithms and data structures described in the book. We mainly list prototypes that are useful for pedagogical purposes. Sometimes we also include high-performance, mature codes that are ready to be used in genome-scale applications.
If you or your students implement some algorithms from the book, please let us know via email and we will link your code from this page. We do our best to assert that it is shared in good trust, but we do not fully check its efficiency or correctness.
Part I – Preliminaries
3. Data structures
3.1 Dynamic range minimum queries
- File rbt_max.cpp inside the NN50-calculator implements fully dynamic range maximum queries using a red-black tree.
- The libmaus2 library implements static range minimum queries using lookup tables, dynamic programming and tree encodings.
3.2 Bitvector rank and select operations
- See the rank_support* and select_support* files in the SDSL library.
- See the rank folder in the libmaus2 library.
3.3 Wavelet tree
- See the wt* files in the SDSL library.
- The libmaus2 library implements static and dynamic wavelet trees, as well as external-memory and parallel construction algorithms.
- The libcds2 library implements the wavelet tree and the wavelet matrix.
4. Graphs
4.1 Directed acyclic graphs (DAGs)
4.1.1 Topological ordering
- See the topologicalSort function in the LEMON library.
4.2 Arbitrary directed graphs
4.2.1 Eulerian paths
- See the connectivity functions in the LEMON library.
4.2.2 Shortest paths and the Bellman-Ford method
- See the implementation of the Bellman-Ford algorithm in the LEMON library.
5. Network flows
5.2 Minimum-cost flows and circulations
- See the minimum-cost flow algorithms in the LEMON library.
- Other algorithms and implementations have been ported under a standardized interface in the MCFClass project.
- IBFS is a solver for the maximum flow problem. This, and other maximum flow solvers, were empirically evaluated on computer vision problems.
5.2.2 A pseudo-polynomial algorithm
- See the minimum-mean cost cycle algorithms implemented in the LEMON library.
5.3 Bipartite matching problems
5.3.1 Perfect matching
- The LEMON library provides an efficient implementation of Edmond's maximum weight perfect matching algorithm (not based on minimum-cost flows), and a number of other matching algorithms.
5.4 Covering problems
5.4.2 Minimum path cover in a DAG
- The reduction to a minimum-cost problem presented in the book is part of the MPC-Solver.
Part II – Fundamentals of Biological Sequence Analysis
6. Alignments
- The libmaus2 library implements a number of algorithms for local, global and sparse sequence alignment, overlap alignment, and LCS computation.
6.1 Edit distance
6.1.1 Edit distance computation
- The example in the book was generated with the script dp2tikz.py.
* 6.1.3 Myers bitparallel algorithm
- See EditDistance.h in this read alignment implementation.
6.2 Longest common subsequence
6.2.1 Sparse dynamic programming
- The algorithm can be found in this prototype targeted to a music retrieval application.
6.3 Approximate string matching
6.4 Biological sequence alignment
6.4.1 Global alignment
- See the pairwise2 module in Biopython.
6.4.2 Local alignment
- See the pairwise2 module in Biopython.
6.4.4 Affine gap scores
- See the pairwise2 module in Biopython.
6.5 Gene alignment
- Some routines are implemented in the codonalignment module in Biopython, but not quite as described in the book.
6.6 Multiple alignment
6.6.2 Dynamic programming
- The MSA package for multiple sequence alignment implements a speed-up for optimal multiple alignment by Carillo and Lipman.
6.6.4 Progressive multiple alignment
6.6.5 DAG alignment
- PAGAN implements DAG-based progressive alignment.
6.6.6 Jumping alignment
- See the JAli project.
7. Hidden Markov models (HMMs)
7.2 The Viterbi algorithm
- See the HMM package in Biopython.
7.3 The forward and backward algorithms
- See the HMM package in Biopython.
7.4 Estimating HMM parameters
- See the HMM package in Biopython.
Part III – Genome-Scale Index Structures
8. Classical indexes
8.1 k-mer index
- The libmaus2 library implements a gamma encoder and decoder, and some applications of gamma-coding like a run-length encoder.
8.2 Suffix array
8.2.1 Suffix and string sorting
- The sais library provides a number of space- and time-efficient implementations.
8.3 Suffix tree
8.3.2 Construction of the suffix tree
- The libmaus2 library implements algorithms for building the LCP array from the suffix array, and from the BWT represented as a wavelet tree.
8.4 Applications of the suffix tree
8.4.1 Maximal repeats
- See REPuter, gt-repfind in GenomeTools, and the MaxRepeats iterator in SEQAN.
8.4.2 Maximal unique matches
- See MUMmer and the MUMs iterator in SEQAN.
8.4.3 Document counting
- The algorithm described in the book is implemented with compressed data structures in this frequent string mining package.
9. Burrows-Wheeler indexes
9.2 BWT index
9.2.3 Succinct suffix array
- Many variants of succinct suffix arrays, FM-indexes, and compressed suffix arrays are implemented in the Pizza&Chili corpus.
- A more generic interface can be found inside the SDSL library.
- The libmaus2 library implements the FM-index, a bidirectional index, and a succinct suffix array.
9.3 Space-efficient construction of the BWT
- See the csalib library for a space-efficient implementation in C of a different algorithm from what we describe.
- See the construct_bwt package for an even more space-efficient implementation in C++.
9.4 Bidirectional BWT index
- See the BD_BWT_index for an implementation of the bidirectional BWT index.
- Section 9.4 and 9.4.1 describe general iterators of the internal nodes of a suffix tree. See VSTree Iterator for an example interface of such an iterator.
9.5 BWT index for labeled trees
- See XBWT.
9.6 BWT index for labeled DAGs
- See GCSA.
9.7 BWT indexes for de Bruijn graphs
9.7.1 Frequency-oblivious representation
- An inefficient, pedagogical implementation in Python is available at debby.py.
- See dbgfm for an implementation of a different data structure based on the Burrows-Wheeler transform.
Part IV – Genome-Scale Algorithms
10. Read alignment
10.1 Pattern partitioning
- The extension of the pigeonhole principle considered in Exercise 10.2 is implemented in ERNE.
10.2 Dynamic programming along suffix tree paths
- Several prototype implementations of pattern partitioning, combined with the bitparallel computation along suffix tree paths, can be found in this package.
10.3 Backtracking on BWT indexes
10.3.1 Prefix pruning
- An almost verbatim implementation can be found in readaligner.
- The original implementation of prefix pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in BWA.
- The idea of pruning by hashing combined with the BWT index, described in Insight 10.3, is implemented in ERNE.
10.3.2 Case analysis pruning with the bidirectional BWT index
- An almost verbatim implementation can be found in readaligner.
- The original implementation of case analysis pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in Bowtie.
- The original implementation of the enhanced case analysis pruning exploiting the bidirectional BWT index, together with some optimizations for DNA alphabet and seed-heuristics, can be found in SOAP2.
10.4 Suffix filtering for approximate overlaps
- An implementation can be found in Overlap tool.
10.6 Split alignment of reads
- One of the book Insights is implemented in TopHat.
10.7 Alignment of reads to a pan-genome
10.7.1 Indexing a set of individual genomes
- An implementation of the hybrid index is available at hybrid.
10.7.2 Indexing a reference genome and a set of variations
- See GCSA.
11. Genome analysis and comparison
11.1 Space-efficient genome analysis
11.1.1 Maximal repeats
- For a pedagogical, inefficient implementation, see the BW4SA library, files maximal_repeats.h and maximal_repeats.c.
11.1.2 Maximal unique matches
- The algorithm for computing MUMs between two sequences is incorporated into the SDSL library: see maximal_unique_matches.cpp.
- For a pedagogical implementation, see also the BW4SA library, files mum.h and mum.c.
11.1.3 Maximal exact matches
- For a pedagogical implementation see the BW4SA library, files mems.h and mems.c.
- Alternative implementations are provided in backwardMEM and at Princeton University.
- Packages Unwords and MAW implement minimal absent words using alternative approaches, including suffix arrays.
- Ultraconserved elements can be seen as a biological generalization of maximal exact matches. A number of tools are available to detect them.
11.2 Comparing genomes without alignment
11.2.1 Substring and k-mer kernels
- For pedagogical implementations, see the BW4SA library.
- CVTree (source code) is a webserver implementation of k-mer kernels that is not designed for space-efficiency.
- The data-driven approach for choosing k detailed in an insight is also implemented in the FFP package.
- Some examples of k-mer extractors are: Jellyfish, DSK, KMC2, BF Counter, Tallymer.
* 11.2.2 Substring kernels with Markovian correction
- A proof-of-concept implementation based on truncated suffix trees is available in the Composerv package.
11.2.3 Substring kernels and matching statistics
- The SASK package implements substring kernels using matching statistics and suffix arrays.
- The backwardSK package implements substring kernels using matching statistics, the BWT, a balanced parentheses data structure, and the LCP array.
- kmacs implements an inexact variant of the average common substring approach described in the insight, but using a simple heuristic.
11.2.4 Mismatch kernels
- A variant of the mismatch kernel described in this section is implemented by the SEQAM lab.
- A collection of other inexact kernels is provided by Christina Leslie's lab. Such implementations are not designed for space-efficiency.
11.2.5 Compression distance
-
The CompLearn suite implements the Normalized Compression Distance and uses it for clustering.
12. Genome compression
12.2 Bit-optimal Lempel-Ziv compression
- An implementation of Relative Lempel-Ziv is implemented in the RLZ package.
- An alternative approach is implemented in the GDC2 package.
13. Fragment assembly
13.2 Contig assembly
- GATB (Genome Assembly & Analysis Tool Box) implements de Bruijn graphs based on a Bloom filter, and supports reverse complements.
- This script is based on GATB and reports all unitigs in a de Bruijn graph.
13.2.1 Read error correction
- The algorithm given in the book is largely simplified from what actual tools use. Perhaps the most similar approach is implemented in LoRDEC.
13.2.3 Irreducible overlap graphs
- The algorithm described in the Insight is implemented in SGA.
13.3 Scaffolding
- A similar scaffolding algorithm as the described at the end of the section is implemented by ScaffMatch, and described in I. Mandric, A. Zelikovsky, ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching, RECOMB 2015: 222-223.
13.4 Gap filling
- The algorithm we describe is implemented in Gap2Seq.
Part V – Applications
14. Genomics
14.1 Variation calling
14.1.1 Calling small variants
14.1.2 Calling large variants
- Some popular tools distantly following the ideas of this section are Pindel, VariationHunter, and CLEVER.
14.2 Variation calling over pan-genomes
14.2.2 Alignments on the labeled DAG of a population
- See GCSA.
14.2.3 Evaluation of variation calling results
- See DAlign.
14.3 Haplotype assembly and phasing
- The algorithm described in the book was implemented by WhatsHap.
15. Transcriptomics
15.1 Estimating the expression of annotated transcripts
- See the Least Squares solvers implemented in MATLAB (in particular, functions lsqnonneg or lsqlin for the problem as considered in the book).
15.2 Transcript assembly
15.2.1 Short reads
- A minimum path cover approach based on read overlaps that resembles the one described in this section is implemented in Cufflinks.
15.2.2 Long reads
- See Traphlor.
- An alternative approach for exploiting long reads or partially-assembled transcripts, and also based on a minimum path cover formulation, is implemented in BRANCH.
15.2.3 Paired-end reads
- See CLASS for an alternative approach that takes into account paired-end read information.
15.3 Simultaneous assembly and expression estimation
- The minimum-cost flow solutions for Problem 15.6, and the algorithm asked for in Exercise 15.12 are implemented in Traph.
- Traph also implements Problem 15.6 using minimum-cost flows and the reduction from Insight 5.2 from convex costs to linear costs.
- The addition of a regularization term to the objective function mentioned in Insight 15.2, also solved with a minimum-cost flow problem, is implemented in flipflop.
15.4 Transcript alignment with co-linear chaining
- An implementation is included in our assembly validator package.
16. Metagenomics
16.1 Species estimation
16.1.1 Single-read methods
- The lowest common ancestor method for classifying reads is implemented in the MEGAN package.
16.1.2 Multi-read and coverage-sensitive methods
- Variants of the lowest common ancestor method, applied to contigs and read clusters, are implemented in MetaCluster 4.
- The set-cover heuristic for read classification is implemented in the MTR package.
- Taxonomic markers for read classification are implemented in MetaPhyler and in
MetaPhlAn. MetaRef is a database of markers, cores and crowns.
16.1.2 Multi-read and coverage-sensitive methods
- A procatical implementation of Problem 16.2 is by the tool MetaFlow.
16.2 Read clustering
- The pipeline for read clustering described in this section is originally implemented in MetaCluster 5. This implementation includes disjoint-set forests (Insight 16.1) and k-means clustering (Insight 16.2), but it is not designed for space-efficiency.
- The bwtCluster package implements the pipeline space-efficiently.
- An alternative clustering criterion based on rare k-mers is implemented in TOSS.
- CONCOCT is a tool for the simultaneous clustering of contigs from multiple metagenomic samples.
16.3 Comparing metagenomic samples
- Meta-Storms compares metagenomic samples using their estimated taxonomic composition.
16.3.1 Sequence-based methods
- d2-tools implements a number of dissimilarity measures on k-mers which have been applied to metagenomes and metatranscriptomes.
16.3.2 Read-based methods
- commet compares metagenomic samples using a read similarity criterion.
16.3.3 Multi-sample methods
- The following string mining implementation is space-efficient and it can be used for comparing metagenomic samples at the sequence level.
- The Distributed String Mining Framework implements the previous approach in a distributed architecture.
8.1 k-mer index
- The libmaus2 library implements a gamma encoder and decoder, and some applications of gamma-coding like a run-length encoder.
8.2 Suffix array
8.2.1 Suffix and string sorting
- The sais library provides a number of space- and time-efficient implementations.
8.3 Suffix tree
8.3.2 Construction of the suffix tree
- The libmaus2 library implements algorithms for building the LCP array from the suffix array, and from the BWT represented as a wavelet tree.
8.4 Applications of the suffix tree
8.4.1 Maximal repeats
- See REPuter, gt-repfind in GenomeTools, and the MaxRepeats iterator in SEQAN.
8.4.2 Maximal unique matches
- See MUMmer and the MUMs iterator in SEQAN.
8.4.3 Document counting
- The algorithm described in the book is implemented with compressed data structures in this frequent string mining package.
9. Burrows-Wheeler indexes
9.2 BWT index
9.2.3 Succinct suffix array
- Many variants of succinct suffix arrays, FM-indexes, and compressed suffix arrays are implemented in the Pizza&Chili corpus.
- A more generic interface can be found inside the SDSL library.
- The libmaus2 library implements the FM-index, a bidirectional index, and a succinct suffix array.
9.3 Space-efficient construction of the BWT
- See the csalib library for a space-efficient implementation in C of a different algorithm from what we describe.
- See the construct_bwt package for an even more space-efficient implementation in C++.
9.4 Bidirectional BWT index
- See the BD_BWT_index for an implementation of the bidirectional BWT index.
- Section 9.4 and 9.4.1 describe general iterators of the internal nodes of a suffix tree. See VSTree Iterator for an example interface of such an iterator.
9.5 BWT index for labeled trees
- See XBWT.
9.6 BWT index for labeled DAGs
- See GCSA.
9.7 BWT indexes for de Bruijn graphs
9.7.1 Frequency-oblivious representation
- An inefficient, pedagogical implementation in Python is available at debby.py.
- See dbgfm for an implementation of a different data structure based on the Burrows-Wheeler transform.
Part IV – Genome-Scale Algorithms
10. Read alignment
10.1 Pattern partitioning
- The extension of the pigeonhole principle considered in Exercise 10.2 is implemented in ERNE.
10.2 Dynamic programming along suffix tree paths
- Several prototype implementations of pattern partitioning, combined with the bitparallel computation along suffix tree paths, can be found in this package.
10.3 Backtracking on BWT indexes
10.3.1 Prefix pruning
- An almost verbatim implementation can be found in readaligner.
- The original implementation of prefix pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in BWA.
- The idea of pruning by hashing combined with the BWT index, described in Insight 10.3, is implemented in ERNE.
10.3.2 Case analysis pruning with the bidirectional BWT index
- An almost verbatim implementation can be found in readaligner.
- The original implementation of case analysis pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in Bowtie.
- The original implementation of the enhanced case analysis pruning exploiting the bidirectional BWT index, together with some optimizations for DNA alphabet and seed-heuristics, can be found in SOAP2.
10.4 Suffix filtering for approximate overlaps
- An implementation can be found in Overlap tool.
10.6 Split alignment of reads
- One of the book Insights is implemented in TopHat.
10.7 Alignment of reads to a pan-genome
10.7.1 Indexing a set of individual genomes
- An implementation of the hybrid index is available at hybrid.
10.7.2 Indexing a reference genome and a set of variations
- See GCSA.
11. Genome analysis and comparison
11.1 Space-efficient genome analysis
11.1.1 Maximal repeats
- For a pedagogical, inefficient implementation, see the BW4SA library, files maximal_repeats.h and maximal_repeats.c.
11.1.2 Maximal unique matches
- The algorithm for computing MUMs between two sequences is incorporated into the SDSL library: see maximal_unique_matches.cpp.
- For a pedagogical implementation, see also the BW4SA library, files mum.h and mum.c.
11.1.3 Maximal exact matches
- For a pedagogical implementation see the BW4SA library, files mems.h and mems.c.
- Alternative implementations are provided in backwardMEM and at Princeton University.
- Packages Unwords and MAW implement minimal absent words using alternative approaches, including suffix arrays.
- Ultraconserved elements can be seen as a biological generalization of maximal exact matches. A number of tools are available to detect them.
11.2 Comparing genomes without alignment
11.2.1 Substring and k-mer kernels
- For pedagogical implementations, see the BW4SA library.
- CVTree (source code) is a webserver implementation of k-mer kernels that is not designed for space-efficiency.
- The data-driven approach for choosing k detailed in an insight is also implemented in the FFP package.
- Some examples of k-mer extractors are: Jellyfish, DSK, KMC2, BF Counter, Tallymer.
* 11.2.2 Substring kernels with Markovian correction
- A proof-of-concept implementation based on truncated suffix trees is available in the Composerv package.
11.2.3 Substring kernels and matching statistics
- The SASK package implements substring kernels using matching statistics and suffix arrays.
- The backwardSK package implements substring kernels using matching statistics, the BWT, a balanced parentheses data structure, and the LCP array.
- kmacs implements an inexact variant of the average common substring approach described in the insight, but using a simple heuristic.
11.2.4 Mismatch kernels
- A variant of the mismatch kernel described in this section is implemented by the SEQAM lab.
- A collection of other inexact kernels is provided by Christina Leslie's lab. Such implementations are not designed for space-efficiency.
11.2.5 Compression distance
- The CompLearn suite implements the Normalized Compression Distance and uses it for clustering.
12. Genome compression
12.2 Bit-optimal Lempel-Ziv compression
- An implementation of Relative Lempel-Ziv is implemented in the RLZ package.
- An alternative approach is implemented in the GDC2 package.
13. Fragment assembly
13.2 Contig assembly
- GATB (Genome Assembly & Analysis Tool Box) implements de Bruijn graphs based on a Bloom filter, and supports reverse complements.
- This script is based on GATB and reports all unitigs in a de Bruijn graph.
13.2.1 Read error correction
- The algorithm given in the book is largely simplified from what actual tools use. Perhaps the most similar approach is implemented in LoRDEC.
13.2.3 Irreducible overlap graphs
- The algorithm described in the Insight is implemented in SGA.
13.3 Scaffolding
- A similar scaffolding algorithm as the described at the end of the section is implemented by ScaffMatch, and described in I. Mandric, A. Zelikovsky, ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching, RECOMB 2015: 222-223.
13.4 Gap filling
- The algorithm we describe is implemented in Gap2Seq.
Part V – Applications
14. Genomics
14.1 Variation calling
14.1.1 Calling small variants
14.1.2 Calling large variants
- Some popular tools distantly following the ideas of this section are Pindel, VariationHunter, and CLEVER.
14.2 Variation calling over pan-genomes
14.2.2 Alignments on the labeled DAG of a population
- See GCSA.
14.2.3 Evaluation of variation calling results
- See DAlign.
14.3 Haplotype assembly and phasing
- The algorithm described in the book was implemented by WhatsHap.
15. Transcriptomics
15.1 Estimating the expression of annotated transcripts
- See the Least Squares solvers implemented in MATLAB (in particular, functions lsqnonneg or lsqlin for the problem as considered in the book).
15.2 Transcript assembly
15.2.1 Short reads
- A minimum path cover approach based on read overlaps that resembles the one described in this section is implemented in Cufflinks.
15.2.2 Long reads
- See Traphlor.
- An alternative approach for exploiting long reads or partially-assembled transcripts, and also based on a minimum path cover formulation, is implemented in BRANCH.
15.2.3 Paired-end reads
- See CLASS for an alternative approach that takes into account paired-end read information.
15.3 Simultaneous assembly and expression estimation
- The minimum-cost flow solutions for Problem 15.6, and the algorithm asked for in Exercise 15.12 are implemented in Traph.
- Traph also implements Problem 15.6 using minimum-cost flows and the reduction from Insight 5.2 from convex costs to linear costs.
- The addition of a regularization term to the objective function mentioned in Insight 15.2, also solved with a minimum-cost flow problem, is implemented in flipflop.
15.4 Transcript alignment with co-linear chaining
- An implementation is included in our assembly validator package.
16. Metagenomics
16.1 Species estimation
16.1.1 Single-read methods
- The lowest common ancestor method for classifying reads is implemented in the MEGAN package.
16.1.2 Multi-read and coverage-sensitive methods
- Variants of the lowest common ancestor method, applied to contigs and read clusters, are implemented in MetaCluster 4.
- The set-cover heuristic for read classification is implemented in the MTR package.
- Taxonomic markers for read classification are implemented in MetaPhyler and in MetaPhlAn. MetaRef is a database of markers, cores and crowns.
16.1.2 Multi-read and coverage-sensitive methods
- A procatical implementation of Problem 16.2 is by the tool MetaFlow.
16.2 Read clustering
- The pipeline for read clustering described in this section is originally implemented in MetaCluster 5. This implementation includes disjoint-set forests (Insight 16.1) and k-means clustering (Insight 16.2), but it is not designed for space-efficiency.
- The bwtCluster package implements the pipeline space-efficiently.
- An alternative clustering criterion based on rare k-mers is implemented in TOSS.
- CONCOCT is a tool for the simultaneous clustering of contigs from multiple metagenomic samples.
16.3 Comparing metagenomic samples
- Meta-Storms compares metagenomic samples using their estimated taxonomic composition.
16.3.1 Sequence-based methods
- d2-tools implements a number of dissimilarity measures on k-mers which have been applied to metagenomes and metatranscriptomes.
16.3.2 Read-based methods
- commet compares metagenomic samples using a read similarity criterion.
16.3.3 Multi-sample methods
- The following string mining implementation is space-efficient and it can be used for comparing metagenomic samples at the sequence level.
- The Distributed String Mining Framework implements the previous approach in a distributed architecture.