All posts

Research summary

Why we built Splam: a splice-junction model that cleans up RNA-seq

Almost every RNA-seq analysis begins the same way: you align millions of short reads back to the genome. Wherever a read jumps across an intron, the aligner records a splice junction — direct evidence of where the cell cut the pre-mRNA and joined the exons. The trouble is that many of those junctions aren’t real. Alignment errors, sequencing noise, and the general messiness of transcription all conspire to produce spurious junctions, and they quietly poison everything downstream: transcript assembly, quantification, annotation. If you want to trust your transcriptome, you first have to tell the real splice junctions from the noise.

Splam is the tool I built to do that.

Why not just use SpliceAI?

By the time I started, SpliceAI had already shown, beautifully, that a deep neural network could recognize splice sites directly from sequence. But it was built for a different job than ours, and the mismatch mattered in three ways.

First, context. SpliceAI reads up to ~10 kilobases of flanking sequence around each position to make a call. That’s a lot of DNA — and more than the splicing machinery itself seems to need, since the signals that define a donor or acceptor site are concentrated within a couple hundred bases. Second, scope. SpliceAI was trained on a single canonical isoform per gene, and on protein-coding genes only — but most human genes are alternatively spliced, with five to ten isoforms each, and an honest model of splicing should recognize all of their introns, coding or not. Third, framing. SpliceAI scores donor and acceptor sites independently, scanning the genome base by base, whereas a real intron is defined by a pair: a donor and an acceptor that belong together.

In the Salzberg and Pertea labs, where I did my Ph.D., transcript assembly is daily bread, and the practical question we kept hitting was simpler than “where are all the splice sites?” — it was “which of these aligned junctions can I trust?” So I set out to build a compact model, focused on the junction as the unit, that we could actually drop into an RNA-seq pipeline. I built it with Alan Mao, an undergraduate I was mentoring. It was my first splice-site deep-learning project — its ideas later grew into OpenSpliceAI — and it followed a motto I keep coming back to: build what you need, use what you build.

What Splam is

Splam is a deep dilated residual convolutional network that looks at just 800 nucleotides — 200 bp on each side of a candidate donor, and 200 bp on each side of its acceptor — and labels every base a donor, an acceptor, or neither.

Figure 1. The Splam model. (a) A deep stack of dilated residual units whose growing dilation widens the receptive field, ending in a per-base softmax over donor / acceptor / neither. (b) The residual unit — grouped dilated convolutions with batch norm and leaky ReLU. (c, d) Splam reads a donor and its acceptor together as one 800 nt input (200 bp flanking each site), scoring the junction as a pair.

The Splam model: a dilated residual convolutional network that reads 800 nucleotides — 200 bp on each side of the donor and acceptor — and labels every base donor, acceptor, or neither.

Two choices set it apart from SpliceAI. It uses a short, biologically motivated window — about fifteen times less sequence — because the information that defines a splice site is mostly local. And it scores the donor and acceptor together, as a junction, mirroring how the spliceosome recognizes both ends of an intron at once. We also trained it on splice sites drawn from all of a gene’s isoforms, not just the canonical one, so it sees the real diversity of human splicing.

What it showed

It’s more accurate than SpliceAI — especially where it’s hard. On held-out human junctions Splam reaches 96% accuracy. On the canonical (MANE) splice sites it’s neck-and-neck with SpliceAI; but on alternative-isoform splice sites — the harder, more interesting ones — it pulls clearly ahead, despite reading a fraction of the sequence. The starkest way to see this: at a matched score threshold, Splam recovers 2,609 real splice junctions that SpliceAI misses, while missing only 202 that SpliceAI catches.

Figure 2. Catching what SpliceAI misses. Across score thresholds, Splam recovers thousands of genuine splice junctions that SpliceAI labels negative (green) — 2,609 at the marked threshold — while the reverse case, junctions SpliceAI finds but Splam misses, stays negligible (orange, 202).

A chart showing Splam recovers thousands of splice junctions SpliceAI misses — 2,609 at the marked threshold — while missing only 202 that SpliceAI catches.

Because Splam scores a junction as a pair, its donor and acceptor scores stay tightly consistent with each other, and the model generalizes to other species — chimpanzee, mouse, even the plant Arabidopsis — without retraining.

And it cleans up alignments. This is where the name comes from, and the reason I built it. Drop Splam into an RNA-seq workflow right after alignment: it pulls every spliced alignment’s junctions out of the BAM file, scores them, and removes the alignments that rest on spurious junctions — handing the assembler a cleaner file.

Figure 3. Splam as a pipeline step. (a) After reads are aligned (HISAT2/STAR), Splam cleans the spurious spliced alignments before transcript assembly (StringTie). (b) Internally, Splam extracts each alignment’s junctions, scores them as good or bad, and writes out a cleaned BAM alongside a separate file of the removed alignments.

The Splam workflow: inserted after read alignment in an RNA-seq pipeline, it extracts and scores splice junctions and removes spurious spliced alignments before transcript assembly.

The payoff is concrete: cleaning the alignments raises intron precision — fewer false introns — with almost no loss of real ones, and yields more correctly assembled transcripts, across both poly-A-capture and ribosomal-RNA-depletion datasets.

The bigger picture

Splam is a small, sharp tool with a clear job, and I’m fond of it for exactly that. But the lessons it taught me are the part I carried forward. Local context is often enough — you don’t always need to read a whole gene to judge a splice site. Model the biological unit — here, the junction pair — rather than its parts in isolation. And train on the real diversity of the data, not a tidy canonical subset. Those same instincts shaped my later splicing work, including OpenSpliceAI, the open, retrainable reimplementation of SpliceAI.

Splam is free and open source, with code and documentation online — drop it into your RNA-seq pipeline and see what it removes.


Read the paper in Genome Biology, browse the code, work through the documentation, or watch the talk. Splam was built with Alan Mao, Steven Salzberg, and Mihaela Pertea at Johns Hopkins.

Further reading: SpliceAI (Jaganathan et al., Cell 2019), the splice-site model Splam is benchmarked against; and OpenSpliceAI, my open PyTorch reimplementation of SpliceAI that grew out of this work.