All research

RNA splicing

Splice site prediction with deep neural networks

Paper Code Talk News

RNA splicing — excising introns and joining exons — is one of the most intricate steps of gene expression, and when it misfires it underlies a large share of genetic disease. I’m interested in teaching neural networks to read the splicing code directly from sequence: to find where splicing occurs, to understand how it shifts in alternative splicing, and to predict how a single genetic variant can create or destroy a splice site.

The problem

Two gaps motivated this work. First, RNA-seq aligners place millions of spliced reads, but many junctions are spurious — noise that corrupts downstream transcript assembly and quantification. Second, the most accurate splice-site model, SpliceAI, was trained only on human and is hard to retrain, leaving other species behind.

What I built

I designed a deep, dilated residual convolutional network that scores every base as a donor, acceptor, or neither, using dilations to see the long-range context splicing depends on while staying compact. I released it as two tools: Splam, which evaluates a focused window around each candidate site to clean up spliced alignments, and OpenSpliceAI, an efficient, modular PyTorch re-implementation of SpliceAI that anyone can retrain on a new species.

Figure 1. The Splam model. (a) A dilated residual CNN whose growing dilation widens the receptive field, ending in a per-base softmax. (b) The residual unit — two grouped dilated convolutions with batch-norm and LeakyReLU. (c) Splam scores donor and acceptor sites from the 200 bp flanking each. (d) The one-hot input encoding of the 800 bp context (400 bp around the donor and acceptor).

Splam model overview: architecture, residual unit, donor/acceptor framing, and input encoding

What it showed

As Figure 1 shows, Splam learns splicing from local sequence context alone, and using it to filter alignments removes mis-spliced reads and sharpens transcript assembly. OpenSpliceAI reproduces SpliceAI-level accuracy while running faster and — crucially — retraining cleanly on non-human genomes, where it outperforms the human model applied off-the-shelf.

In brief: open, retrainable splice-site models that sharpen alignments and predict the splicing impact of variants. (Splam — Genome Biology, 2024; OpenSpliceAI — eLife, 2025; code.)