All research

Genome annotation

Building accurate genome annotations

PaperCodeTalkBlog

A genome assembly provides sequence; an annotation provides the biological map that makes the sequence useful. It identifies genes, transcripts, exons, coding regions, and their relationships. Resources such as NCBI RefSeq (O’Leary et al., 2016) provide annotations for widely studied organisms, but assemblies are produced faster than annotations can be curated.

One response is annotation lift-over: transfer a trusted annotation from a reference genome to a new assembly. I focus on this problem through LiftOn, a new algorithm I proposed to combine DNA and protein alignments. DNA preserves locus and transcript structure between related genomes; protein conservation helps recover coding structure when nucleotide sequences diverge. LiftOn turns those complementary signals into one annotation rather than treating either alignment as sufficient on its own.

Questions that drive my work

The first question is how to combine complementary evidence explicitly. Liftoff (Shumate and Salzberg, 2021) maps gene models through DNA alignment and preserves their hierarchy. Protein-to-genome aligners such as miniprot (Li, 2023) remain informative across greater evolutionary distance, but their models can be fragmented and do not preserve every non-coding feature. A useful algorithm needs a clear rule for choosing supported coding segments while retaining the broader annotation structure.

The second is what the algorithm should optimize and validate. A transferred model may map to the expected locus yet contain a frameshift, premature stop, or incorrect coding boundary. Protein identity, open-reading-frame integrity, overlap resolution, and gene-copy recovery measure different aspects of quality. The objective should be explicit, and the emitted GFF should remain structurally valid.

The third is scale and reproducibility. A method must finish whole genomes, handle unusually large and repetitive genes, preserve non-coding features, and produce deterministic output. Accuracy on a small benchmark is not enough if the implementation cannot complete the assemblies researchers need to annotate. General annotation systems such as BRAKER3 (Brůna et al., 2024) integrate experimental and comparative evidence for de novo annotation; lift-over addresses the complementary case where a trusted reference annotation already exists.

Directions I have explored

LiftOn: combining DNA and protein alignments

I developed LiftOn (Chao et al., 2025) to improve annotation lift-over across both assemblies and species. It starts with DNA-based models from Liftoff and protein-based candidates from miniprot. LiftOn translates each candidate, aligns it with the reference protein, and uses a protein-maximization chaining algorithm to select the better-supported coding segments. An open-reading-frame search can then repair remaining coding disruptions. The algorithm also resolves overlapping loci and searches for additional gene copies that a one-to-one transfer can miss.

The central idea is not to average DNA and protein evidence. It preserves their distinct strengths and reconciles disagreements through a protein-centered objective. In the published evaluation, LiftOn transferred annotations within species and across more divergent species pairs, where DNA-only mapping becomes less reliable. My continuing work focuses on whole-genome robustness, bounded-memory alignment, output validation, and broader feature support so the same algorithmic principle remains practical at scale.

Applying lift-over to complete human genomes

My annotation work on complete human genomes provides the path that led to and now applies LiftOn. For CHESS 3 (Varabyou et al., 2023), my contribution focused on lifting the CHESS 3 annotation from GRCh38 to T2T-CHM13. In the earlier Han1 project (Chao et al., 2023), we used Liftoff in a two-stage workflow to transfer annotation to a gapless Southern Han Chinese genome, with separate handling for repetitive ribosomal DNA arrays. Han1 predates LiftOn, but it exposed the practical demands of annotating a newly completed genome.

The complete diploid HG002 benchmark (Hansen et al., 2025) provided a later application. The project generated annotations for the maternal and paternal haplotypes and used LiftOn, together with miniprot, to help identify additional gene copies from translated MANE transcripts. My contribution was on this annotation effort. This is the setting LiftOn is designed for: carrying trusted gene knowledge onto a more complete genome while preserving differences that may represent real copy-number variation.

Future directions

Complete genomes and pangenomes create a promising opportunity to make annotation lift-over both more accurate and more informative about real biological differences. My next steps for LiftOn are to strengthen whole-genome robustness, broaden feature support, integrate complementary DNA and protein evidence, and make every correction traceable through explicit validation. I want LiftOn to become a dependable algorithm for transferring trusted annotations across individuals and species while preserving genuine gene-content differences and producing reproducible, structurally valid results at genome scale.

References

  1. O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research (2016). doi:10.1093/nar/gkv1189
  2. Brůna, T. et al. BRAKER3: fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS, and TSEBRA. Genome Research (2024). doi:10.1101/gr.278090.123
  3. Shumate, A. and Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics (2021). doi:10.1093/bioinformatics/btaa1016
  4. Li, H. Protein-to-genome alignment with miniprot. Bioinformatics (2023). doi:10.1093/bioinformatics/btad014
  5. Chao, K.-H. et al. Combining DNA and protein alignments to improve genome annotation with LiftOn. Genome Research (2025). doi:10.1101/gr.279620.124
  6. Varabyou, A. et al. CHESS 3: an improved, comprehensive catalog of human genes and transcripts. Genome Biology (2023). doi:10.1186/s13059-023-03088-4
  7. Chao, K.-H. et al. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3 (2023). doi:10.1093/g3journal/jkac321
  8. Hansen, N. F. et al. A complete diploid human genome benchmark for personalized genomics. bioRxiv (2025). doi:10.1101/2025.09.21.677443