Genome annotation
Improving genome annotation with LiftOn
Accurate gene annotation — knowing where genes and their exons lie — is the foundation everything else in genome biology builds on. As assemblies are now produced far faster than they can be curated, I’m interested in transferring annotations across genomes reliably, and in fusing two kinds of evidence that are usually used in isolation: DNA alignment (fast, but brittle as species diverge) and protein alignment (robust to sequence change, but fragmentary).
The problem
The standard way to annotate a new assembly is to lift over genes from a well-annotated one. DNA-based lift-over (Liftoff) is fast but, across divergence or assembly differences, it introduces frameshifts, premature stop codons, and wrong exon boundaries. Protein-based mapping (miniprot) survives sequence change but produces fragmented, sometimes conflicting gene models. Neither alone is trustworthy.
What I built
I developed LiftOn, which combines both signals. It aligns each candidate gene’s protein from Liftoff and from miniprot back to the reference, scores the alignments for mismatches, indels, and premature stops, and then runs a protein-maximization CDS chaining algorithm that stitches the best-supported coding segments into one coherent gene model.
Figure 1. The LiftOn workflow. A gene is annotated independently from DNA evidence (Liftoff) and protein evidence (miniprot) (A); each candidate’s protein is aligned to the reference to score mismatches, indels, and premature stops (B–C); LiftOn chains coding segments to maximize protein-alignment identity (D); and emits a final, more accurate annotation (E).
What it showed
As Figure 1 traces, the two evidence types are complementary, and the chaining step in panel D is where LiftOn turns their disagreement into a gene model better than either source alone. Across human and other genomes it recovers more complete, in-frame protein-coding annotations than Liftoff or miniprot on their own — especially where the target diverges from the reference.
In brief: a lift-over tool that fuses DNA and protein alignments for more accurate annotations across species. (Genome Research, 2025; documentation; code.)