Dear Friends,
I’m excited to invite you to my public thesis defense talk. It would mean a lot to have you there as I share this special milestone in my PhD journey!
The field of computational biology is being reshaped by two synergistic revolutions: ever-more accurate high-throughput sequencing and advanced computational methods. At this nexus, deep learning now exploits the exponential growth of genomic data to extract insights once out of reach. Two complementary advances exemplify the shift: (i) Transformer-based predictive models that predict cell-type-specific gene expression and chromatin profiles directly from long-range DNA context, and (ii) foundational DNA language models, trained self-supervised at scale, that learn universal, cross-species nucleotide embeddings. Together, high-quality assemblies, genome annotation, and genome-wide prediction methods are transforming how we annotate, decode, and even engineer the language of genomes.
In this thesis, we introduce a suite of computationally efficient methods that systematically bridge raw nucleotide sequence to biological function. First, we present Han1, the first gapless, reference‑quality assembly and annotation of a Southern Han Chinese genome, achieved by integrating ultra‑long Oxford Nanopore and high‑fidelity PacBio reads with the T2T‑CHM13 reference. Next, the Wheeler Graph Toolkit (WGT) combines permutation heuristics and satisfiability‑modulo‑theory solvers to recognize and visualize Wheeler graphs, laying the theoretical foundation for pangenome indexing. We then describe LiftOn, a hybrid liftover tool that fuses protein‑to‑genome and DNA‑to‑genome alignments to refine exon boundaries, correct frameshifts, and identify additional gene copies, outperforming existing annotation‑transfer methods on divergent assemblies.
Building on these genomic substrates, we develop two complementary deep‑learning frameworks for splice‑site prediction: OpenSpliceAI is a PyTorch re‑implementation of SpliceAI that delivers faster inference and seamless retraining; Splam is a compact residual convolutional neural network that accurately predicts splice junctions from local sequence context. Finally, Shorkie demonstrates the power of self‑supervised foundation models by fine‑tuning a multi‑species fungal DNA language model on time‑course yeast RNA‑Seq data, achieving state‑of‑the‑art gene‑expression and variant‑effect prediction.