We provide here a non-exhaustive list of implementations of the algorithms and data structures described in the book. We mainly list prototypes that are useful for pedagogical purposes. Sometimes we also include high-performance, mature codes that are ready to be used in genome-scale applications.

If you or your students implement some algorithms from the book, please let us know via email and we will link your code from this page. We do our best to assert that it is shared in good trust, but we do not fully check its efficiency or correctness.

Part I – Preliminaries

3. Data structures

3.1 Dynamic range minimum queries

File rbt_max.cpp inside the NN50-calculator implements fully dynamic range maximum queries using a red-black tree.
The libmaus2 library implements static range minimum queries using lookup tables, dynamic programming and tree encodings.

3.2 Bitvector rank and select operations

See the rank_support* and select_support* files in the SDSL library.
See the rank folder in the libmaus2 library.

3.3 Wavelet tree

See the wt* files in the SDSL library.
The libmaus2 library implements static and dynamic wavelet trees, as well as external-memory and parallel construction algorithms.
The libcds2 library implements the wavelet tree and the wavelet matrix.

4. Graphs

4.1 Directed acyclic graphs (DAGs)

4.1.1 Topological ordering

See the topologicalSort function in the LEMON library.

4.2 Arbitrary directed graphs

4.2.1 Eulerian paths

See the connectivity functions in the LEMON library.

4.2.2 Shortest paths and the Bellman-Ford method

See the implementation of the Bellman-Ford algorithm in the LEMON library.

5. Network flows

5.2 Minimum-cost flows and circulations

See the minimum-cost flow algorithms in the LEMON library.
Other algorithms and implementations have been ported under a standardized interface in the MCFClass project.
IBFS is a solver for the maximum flow problem. This, and other maximum flow solvers, were empirically evaluated on computer vision problems.

5.2.2 A pseudo-polynomial algorithm

See the minimum-mean cost cycle algorithms implemented in the LEMON library.

5.3 Bipartite matching problems

5.3.1 Perfect matching

The LEMON library provides an efficient implementation of Edmond's maximum weight perfect matching algorithm (not based on minimum-cost flows), and a number of other matching algorithms.

5.4 Covering problems

5.4.2 Minimum path cover in a DAG

The reduction to a minimum-cost problem presented in the book is part of the MPC-Solver.

Part II – Fundamentals of Biological Sequence Analysis

6. Alignments

The libmaus2 library implements a number of algorithms for local, global and sparse sequence alignment, overlap alignment, and LCS computation.

6.1 Edit distance

6.1.1 Edit distance computation

The example in the book was generated with the script dp2tikz.py.

* 6.1.3 Myers bitparallel algorithm

See EditDistance.h in this read alignment implementation.

6.2 Longest common subsequence

6.2.1 Sparse dynamic programming

The algorithm can be found in this prototype targeted to a music retrieval application.

6.3 Approximate string matching

6.4 Biological sequence alignment

6.5 Gene alignment

Some routines are implemented in the codonalignment module in Biopython, but not quite as described in the book.

6.6 Multiple alignment

6.6.2 Dynamic programming

The MSA package for multiple sequence alignment implements a speed-up for optimal multiple alignment by Carillo and Lipman.

6.6.4 Progressive multiple alignment

See the Align package in Biopython.

6.6.5 DAG alignment

PAGAN implements DAG-based progressive alignment.

6.6.6 Jumping alignment

See the JAli project.

7. Hidden Markov models (HMMs)

7.2 The Viterbi algorithm

See the HMM package in Biopython.

7.3 The forward and backward algorithms

See the HMM package in Biopython.

7.4 Estimating HMM parameters

See the HMM package in Biopython.

Part III – Genome-Scale Index Structures

8. Classical indexes

8.1 k-mer index

The libmaus2 library implements a gamma encoder and decoder, and some applications of gamma-coding like a run-length encoder.

8.2 Suffix array

8.2.1 Suffix and string sorting

The sais library provides a number of space- and time-efficient implementations.

8.3 Suffix tree

8.3.2 Construction of the suffix tree

The libmaus2 library implements algorithms for building the LCP array from the suffix array, and from the BWT represented as a wavelet tree.

8.4 Applications of the suffix tree

8.4.1 Maximal repeats

See REPuter, gt-repfind in GenomeTools, and the MaxRepeats iterator in SEQAN.

8.4.2 Maximal unique matches

See MUMmer and the MUMs iterator in SEQAN.

8.4.3 Document counting

The algorithm described in the book is implemented with compressed data structures in this frequent string mining package.

9. Burrows-Wheeler indexes

9.2 BWT index

9.2.3 Succinct suffix array

Many variants of succinct suffix arrays, FM-indexes, and compressed suffix arrays are implemented in the Pizza&Chili corpus.
A more generic interface can be found inside the SDSL library.
The libmaus2 library implements the FM-index, a bidirectional index, and a succinct suffix array.

9.3 Space-efficient construction of the BWT

See the csalib library for a space-efficient implementation in C of a different algorithm from what we describe.
See the construct_bwt package for an even more space-efficient implementation in C++.

9.4 Bidirectional BWT index

See the BD_BWT_index for an implementation of the bidirectional BWT index.
Section 9.4 and 9.4.1 describe general iterators of the internal nodes of a suffix tree. See VSTree Iterator for an example interface of such an iterator.

9.5 BWT index for labeled trees

See XBWT.

9.6 BWT index for labeled DAGs

See GCSA.

9.7 BWT indexes for de Bruijn graphs

9.7.1 Frequency-oblivious representation

An inefficient, pedagogical implementation in Python is available at debby.py.
See dbgfm for an implementation of a different data structure based on the Burrows-Wheeler transform.

Part IV – Genome-Scale Algorithms

10. Read alignment

10.1 Pattern partitioning

The extension of the pigeonhole principle considered in Exercise 10.2 is implemented in ERNE.

10.2 Dynamic programming along suffix tree paths

Several prototype implementations of pattern partitioning, combined with the bitparallel computation along suffix tree paths, can be found in this package.

10.3 Backtracking on BWT indexes

10.3.1 Prefix pruning

An almost verbatim implementation can be found in readaligner.
The original implementation of prefix pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in BWA.
The idea of pruning by hashing combined with the BWT index, described in Insight 10.3, is implemented in ERNE.

10.3.2 Case analysis pruning with the bidirectional BWT index

An almost verbatim implementation can be found in readaligner.
The original implementation of case analysis pruning, together with some optimizations for DNA alphabet and seed-heuristics, can be found in Bowtie.
The original implementation of the enhanced case analysis pruning exploiting the bidirectional BWT index, together with some optimizations for DNA alphabet and seed-heuristics, can be found in SOAP2.

10.4 Suffix filtering for approximate overlaps

An implementation can be found in Overlap tool.

10.6 Split alignment of reads

One of the book Insights is implemented in TopHat.

10.7 Alignment of reads to a pan-genome

10.7.1 Indexing a set of individual genomes

An implementation of the hybrid index is available at hybrid.

10.7.2 Indexing a reference genome and a set of variations

See GCSA.

11. Genome analysis and comparison

11.1 Space-efficient genome analysis

11.1.1 Maximal repeats

For a pedagogical, inefficient implementation, see the BW4SA library, files maximal_repeats.h and maximal_repeats.c.

11.1.2 Maximal unique matches

The algorithm for computing MUMs between two sequences is incorporated into the SDSL library: see maximal_unique_matches.cpp.
For a pedagogical implementation, see also the BW4SA library, files mum.h and mum.c.

11.1.3 Maximal exact matches

For a pedagogical implementation see the BW4SA library, files mems.h and mems.c.
Alternative implementations are provided in backwardMEM and at Princeton University.
Packages Unwords and MAW implement minimal absent words using alternative approaches, including suffix arrays.
Ultraconserved elements can be seen as a biological generalization of maximal exact matches. A number of tools are available to detect them.

11.2 Comparing genomes without alignment

11.2.1 Substring and k-mer kernels

For pedagogical implementations, see the BW4SA library.
CVTree (source code) is a webserver implementation of k-mer kernels that is not designed for space-efficiency.
The data-driven approach for choosing k detailed in an insight is also implemented in the FFP package.
Some examples of k-mer extractors are: Jellyfish, DSK, KMC2, BF Counter, Tallymer.

* 11.2.2 Substring kernels with Markovian correction

A proof-of-concept implementation based on truncated suffix trees is available in the Composerv package.

11.2.3 Substring kernels and matching statistics

The SASK package implements substring kernels using matching statistics and suffix arrays.
The backwardSK package implements substring kernels using matching statistics, the BWT, a balanced parentheses data structure, and the LCP array.
kmacs implements an inexact variant of the average common substring approach described in the insight, but using a simple heuristic.

11.2.4 Mismatch kernels

A variant of the mismatch kernel described in this section is implemented by the SEQAM lab.
A collection of other inexact kernels is provided by Christina Leslie's lab. Such implementations are not designed for space-efficiency.

11.2.5 Compression distance

The CompLearn suite implements the Normalized Compression Distance and uses it for clustering.

12. Genome compression

12.2 Bit-optimal Lempel-Ziv compression

An implementation of Relative Lempel-Ziv is implemented in the RLZ package.
An alternative approach is implemented in the GDC2 package.

13. Fragment assembly

13.2 Contig assembly

GATB (Genome Assembly & Analysis Tool Box) implements de Bruijn graphs based on a Bloom filter, and supports reverse complements.
This script is based on GATB and reports all unitigs in a de Bruijn graph.

13.2.1 Read error correction

The algorithm given in the book is largely simplified from what actual tools use. Perhaps the most similar approach is implemented in LoRDEC.

13.2.3 Irreducible overlap graphs

The algorithm described in the Insight is implemented in SGA.

13.3 Scaffolding

A similar scaffolding algorithm as the described at the end of the section is implemented by ScaffMatch, and described in I. Mandric, A. Zelikovsky, ScaffMatch: Scaffolding Algorithm Based on Maximum Weight Matching, RECOMB 2015: 222-223.

13.4 Gap filling

The algorithm we describe is implemented in Gap2Seq.

Part V – Applications

14. Genomics

14.1 Variation calling

14.1.1 Calling small variants

Popular workflows for variant calling are implemented in GATK and SAMtools.

14.1.2 Calling large variants

Some popular tools distantly following the ideas of this section are Pindel, VariationHunter, and CLEVER.

14.2 Variation calling over pan-genomes

14.2.2 Alignments on the labeled DAG of a population

See GCSA.

14.2.3 Evaluation of variation calling results

See DAlign.

14.3 Haplotype assembly and phasing

The algorithm described in the book was implemented by WhatsHap.

15. Transcriptomics

15.1 Estimating the expression of annotated transcripts

See the Least Squares solvers implemented in MATLAB (in particular, functions lsqnonneg or lsqlin for the problem as considered in the book).

15.2 Transcript assembly

15.2.1 Short reads

A minimum path cover approach based on read overlaps that resembles the one described in this section is implemented in Cufflinks.

15.2.2 Long reads

See Traphlor.
An alternative approach for exploiting long reads or partially-assembled transcripts, and also based on a minimum path cover formulation, is implemented in BRANCH.

15.2.3 Paired-end reads

See CLASS for an alternative approach that takes into account paired-end read information.

15.3 Simultaneous assembly and expression estimation

The minimum-cost flow solutions for Problem 15.6, and the algorithm asked for in Exercise 15.12 are implemented in Traph.
Traph also implements Problem 15.6 using minimum-cost flows and the reduction from Insight 5.2 from convex costs to linear costs.
The addition of a regularization term to the objective function mentioned in Insight 15.2, also solved with a minimum-cost flow problem, is implemented in flipflop.

15.4 Transcript alignment with co-linear chaining

An implementation is included in our assembly validator package.

16. Metagenomics

16.1 Species estimation

16.1.1 Single-read methods

The lowest common ancestor method for classifying reads is implemented in the MEGAN package.

16.1.2 Multi-read and coverage-sensitive methods

Variants of the lowest common ancestor method, applied to contigs and read clusters, are implemented in MetaCluster 4.
The set-cover heuristic for read classification is implemented in the MTR package.
Taxonomic markers for read classification are implemented in MetaPhyler and in MetaPhlAn. MetaRef is a database of markers, cores and crowns.