Pangenome indexing
Indexing genomic diversity with sequence graphs
A linear reference chooses one path through a species’ variation. That simplification made modern genomics possible, but it creates reference bias: divergent sequence can be harder to align, genotype, and interpret. A pangenome represents many genomes together so shared sequence and alternative alleles remain part of the reference.
Graphs are a natural representation for this diversity. Common regions can share nodes and edges; divergent haplotypes can follow different paths. The draft Human Pangenome Reference (Liao et al., 2023) demonstrates why this matters at human scale, while tools such as vg (Garrison et al., 2018) show that graph references can support read mapping and variant analysis.
Representation alone is not enough. A practical pangenome must be searchable without scanning every genome independently. The challenge is to preserve graph diversity while recovering the speed and compactness of string indexes.
Questions that drive my work
The first question is which graph structures admit efficient indexes. For a string, the Burrows–Wheeler Transform and FM-index support compact full-text search (Ferragina and Manzini, 2005). Some graph classes generalize that structure, but arbitrary graphs do not inherit the same guarantees. Biological representation therefore constrains the available search algorithms.
The second is how to recognize useful structure. Wheeler graphs (Gagie et al., 2017) are edge-labeled graphs whose nodes admit an order compatible with edge labels, extending BWT-style indexing to graphs. Recognizing that order is NP-complete and hard to approximate (Gibney and Thankachan, 2019). The class is practical only if we can recognize members, produce an order, and explain failures at realistic sizes.
The third is how to trust a solver-backed algorithm. Heuristics and satisfiability solvers can make hard problems practical, but speed is not correctness. Every accepted graph should have an independently checkable order, and optimizations should preserve decisions.
I am also interested in repair rather than diagnosis alone: when a graph lacks an indexable structure, can we construct a related graph while preserving the biological paths that matter?
Directions I have explored
Making Wheeler graphs usable
I developed the Wheeler Graph Toolkit (WGT) (Chao et al., 2023) so researchers could generate, recognize, and visualize Wheeler graphs. The project addressed a practical gap between theoretical recognition algorithms and a complete workflow for deciding and inspecting an input graph.
WGT’s recognizer, Wheelie, combines a fast renaming heuristic with exact solvers. Incoming labels constrain where a node can appear; propagating those constraints resolves much of the order before search begins. Remaining ambiguity goes to permutation search or an SMT solver. The pattern is simple: use inexpensive structure to shrink the problem, then apply an exact method where uncertainty remains.
Generators provide known examples and counterexamples, while visualization makes the structure behind a solver result inspectable.
Verification, scaling, and graph repair
I have continued this work by separating correctness, capacity, and speed. Current methods compare recognition with an independent brute-force oracle on small graphs, re-validate every emitted order, and use regression tests to ensure faster encodings do not change decisions.
On the algorithmic side, lazy constraint generation avoids materializing the full solver encoding up front. The solver begins with a relaxation, checks candidate orders, and adds only constraints exposed by counterexamples. A separate repair direction unfolds non-Wheeler directed acyclic graphs into Wheeler graphs while preserving their represented path strings. These results are documented in the linked technical report; I keep them separate from the published WGT paper so the original contribution and ongoing work remain clear.
Future directions
As pangenomes grow, a promising goal is to make graph diversity searchable without hiding the assumptions that enable efficient indexing. The next steps are to scale verified Wheeler-graph recognition, refine lazy solver formulations, and develop repair algorithms that preserve biologically meaningful paths when an input graph is not directly indexable. I want to connect these guarantees to practical pangenome indexes, so researchers can understand which sequences are preserved, why a graph supports a given search method, and how failures are detected or repaired.
References
- Ferragina, P. and Manzini, G. Indexing compressed text. Journal of the ACM (2005). doi:10.1145/1082036.1082039
- Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology (2018). doi:10.1038/nbt.4227
- Gagie, T., Manzini, G., and Sirén, J. Wheeler graphs: a framework for BWT-based data structures. Theoretical Computer Science (2017). doi:10.1016/j.tcs.2017.06.016
- Gibney, D. and Thankachan, S. V. On the hardness and inapproximability of recognizing Wheeler graphs. ESA (2019). doi:10.4230/LIPIcs.ESA.2019.51
- Chao, K.-H. et al. WGT: tools and algorithms for recognizing, visualizing, and generating Wheeler graphs. iScience (2023). doi:10.1016/j.isci.2023.107402
- Liao, W.-W. et al. A draft human pangenome reference. Nature (2023). doi:10.1038/s41586-023-05896-x