All research

RNA splicing

Learning the rules of RNA splicing

PaperCodeNewsTalkOpenSpliceAISplam

RNA splicing forms mature transcripts by removing introns and joining exons. Boundaries must be recognized at single-base precision, but the two boundary dinucleotides are not sufficient. Nearby motifs, broader context, cell state, and regulatory proteins influence which sites are used. Alternative choices create multiple isoforms; incorrect choices can disrupt a protein or contribute to disease.

Computationally, we want to identify donor and acceptor sites, estimate variant effects, and determine which RNA-seq junctions reflect transcripts rather than alignment artifacts. We also need models that can adapt beyond human. These tasks share sequence context but require different data and evaluation.

Models such as SpliceAI (Jaganathan et al., 2019) established the value of long sequence context, while Pangolin (Zeng and Li, 2022) extended prediction toward tissue-specific usage. I study how recognizers can become open, testable components of alignment, annotation, and variant-interpretation workflows.

Questions that drive my work

One question is what the model should recognize. Scanning a chromosome for splice sites differs from evaluating a donor–acceptor pair proposed by an RNA-seq alignment. For junction filtering, the relevant question is whether that pair is credible in its local context. The model and evaluation should match the decision being supported.

A second question is how to separate novelty from noise. Unannotated RNA-seq junctions can be real alternative isoforms or alignment errors. Read support helps, but abundance does not guarantee correctness. Sequence recognition provides independent evidence to combine with the alignment.

A third question is portability. Species differ in intron architecture, sequence composition, and annotation quality. A human model can be informative elsewhere, but an open training pipeline lets the model adapt to the target species.

Directions I have explored

Recognizing candidate junctions to clean RNA-seq alignments

I developed Splam (Chao et al., 2024) around the junction-filtering problem. Splam scores the donor and acceptor of a candidate intron jointly from focused sequence windows. It was trained with canonical-transcript junctions, alternative annotated junctions, and carefully constructed negative examples so that alternative isoforms are represented as positive biology rather than treated as errors.

Applied to spliced alignments, Splam removes low-scoring junctions and reads while retaining supported transcript structure. The test is not only classification accuracy: filtering should improve transcript assembly without erasing credible alternative isoforms. This connects a local recognizer to an operational transcriptomics problem.

Making splice prediction retrainable across species

OpenSpliceAI (Chao et al., 2025) rebuilds the SpliceAI architecture in PyTorch and exposes data preparation, training, transfer learning, genome-scale prediction, and variant scoring. Its contribution is to turn a strong fixed model into a platform that can be inspected, reproduced, and adapted.

We used that capability to train and transfer models for non-human species. The farther a target species is from human, the more important species-specific training becomes. The same software also supports variant-effect prediction, where the reference and alternate alleles are compared to identify gained or lost donor and acceptor signals.

Future directions

Open, task-specific splicing models can connect sequence recognition to better alignments, transcript assemblies, and variant interpretation. The next steps are to make these models easier to retrain across species, calibrate them for each decision, combine their scores with RNA-seq evidence, and evaluate biological outputs rather than classification metrics alone. I want to build this evidence stack so that novel isoforms can be separated from artifacts more reliably and every prediction can be traced to a model, dataset, and downstream test.

References

  1. Jaganathan, K. et al. Predicting splicing from primary sequence with deep learning. Cell (2019). doi:10.1016/j.cell.2018.12.015
  2. Zeng, T. and Li, Y. I. Predicting RNA splicing from DNA sequence using Pangolin. Genome Biology (2022). doi:10.1186/s13059-022-02664-4
  3. Chao, K.-H. et al. Splam: a deep-learning-based splice site predictor that improves spliced alignments. Genome Biology (2024). doi:10.1186/s13059-024-03379-4
  4. Chao, K.-H. et al. OpenSpliceAI provides an efficient modular implementation of SpliceAI enabling easy retraining across nonhuman species. eLife (2025). doi:10.7554/eLife.107454.3