All posts

Research summary

Why we built Shorkie: reading the regulatory code of yeast

Every cell in your body carries essentially the same DNA, yet a neuron and a skin cell could hardly look more different. The difference is regulation — which genes are switched on, how strongly, and exactly when. That program is written into the non-coding DNA around each gene, in a language of transcription-factor binding sites and chromatin signals we still only partly understand. If we could read it fluently — predict a gene’s expression directly from its sequence — we could finally interpret the flood of non-coding variants that genetic studies turn up but can’t explain.

We’re not there yet. Even sophisticated models explain at most around 73% of the variation in gene expression. Shorkie, the project I want to tell you about, is our attempt to close some of that gap — and, just as importantly, to do it in a way where we can check whether the model has actually learned biology rather than memorized its training set.

Why budding yeast?

If you want to test whether a model can learn the rules of gene regulation, budding yeast (Saccharomyces cerevisiae) is almost the perfect laboratory. It is the premier model organism for eukaryotic gene regulation: roughly 7,000 genes controlled by hundreds of transcription factors, decades of careful experiments mapping who binds where, and a genome compact enough — about 12 megabases — to model end to end. It is also a cornerstone of aging research: much of what we understand about lifespan, from sirtuins to calorie restriction, was first worked out in yeast. That matters here, because Shorkie was built at Calico, a company devoted to the biology of aging — and learning to read the regulatory code of a premier aging model is squarely in its wheelhouse.

But that compactness cuts both ways. A 12 Mb genome simply doesn’t contain enough examples to train a large deep-learning model from scratch. So we had a model organism that was ideal in every respect except the one that matters most for deep learning — data.

The bet: borrow power from evolution

Shorkie began as my summer-2024 internship in the Kelley Lab at Calico, and our first instinct was the obvious one: take Borzoi — the lab’s model that predicts RNA-seq coverage directly from DNA sequence (Linder et al., 2025) — and train it on yeast from scratch. It didn’t work. Supervised learning on roughly 6,000 genes gave the network far too little to go on; it overfit the training set and failed to generalize to genes it hadn’t seen. The very compactness that makes yeast so tractable for biologists is exactly what starves a deep model of data.

So we changed tack. If yeast’s own genome was too small, perhaps its relatives could supply what was missing. Across the fungal kingdom, evolution has run essentially the same regulatory experiment millions of times over: the sequences that matter for regulation tend to stay conserved, while the rest drifts freely. We were encouraged by a 2024 study from the Gagneur lab showing that DNA language models trained across hundreds of species learn regulatory elements — and even their evolution — far better than alignment to any single genome allows (Karollus et al., 2024). That was the spark: pretrain a DNA language model on yeast’s fungal relatives first, then fine-tune it on yeast itself.

Figure 1. The tree of life we pretrain across — 1,341 fungal genomes spanning the kingdom, from oyster mushrooms, shiitake, and black truffles to the yeasts. The order Saccharomycetales around budding yeast is highlighted in green; the 165-genome slice of that order turned out to be the sweet spot for transferring regulatory grammar back to S. cerevisiae.

Circular phylogenetic tree of 1,341 fungal genomes, with the order Saccharomycetales highlighted in green.

This is the same self-supervised idea behind large language models, applied to DNA — instead of predicting the next word, the model learns to fill in masked stretches of genome, and in doing so internalizes which patterns are meaningful. And because this was Calico, we could pair it with something most groups don’t have: the ability to generate new data to fine-tune on. The team produced high-resolution RNA-seq time courses — the Induction Dynamics gene Expression Atlas (IDEA) — using miniaturized chemostats, or “ministats,” that let us watch yeast genes respond to a transcription-factor induction minute by minute.

What Shorkie is

Shorkie is a two-stage model. First comes pretraining: we trained a masked DNA language model to reconstruct hidden parts of fungal genomes, and we deliberately tested four nested training sets of increasing breadth — a single yeast reference genome (R64); 80 S. cerevisiae strains; 165 genomes from the broader Saccharomycetales order; and 1,341 genomes spanning the whole fungal kingdom. Strikingly, more data wasn’t simply better: pretraining on the 165-genome Saccharomycetales set gave the largest downstream boost of the four — a sweet spot with enough evolutionary diversity to reveal what’s conserved, but not so much divergence that the signal turns to noise. Evolutionary distance, it turns out, is a dial you can tune.

Then comes fine-tuning: we adapted the pretrained model to predict real measurements, unifying 5,215 experimental tracks in a single network — 3,053 high-resolution induction RNA-seq timepoints, about a thousand strain RNA-seq profiles, 1,128 ChIP-exo transcription-factor binding datasets, and 20 histone-modification tracks. The fine-tuning network is a compact, yeast-scale Borzoi: it reads a 16 kb window of DNA, compresses it through convolutions down to a 128 bp transformer core, then reconstructs predictions at 16 bp resolution through a U-Net — all in just 13.7 million parameters.

Under the hood, the pretraining model has a U-Net shape (Figure 2). During pretraining its only job is self-supervised: predict the identity of every base — A, C, G, or T — across the whole 16,384 bp window. Seven residual down-sampling blocks compress the sequence to a coarse 128 bp representation, where eight Transformer layers integrate long-range context; a mirror-image decoder with skip connections and depthwise-separable convolutions then restores single-base resolution, ending in a softmax over the four nucleotides. It’s a small network by modern standards, which is part of the point — most of its power comes from what it was trained on, not from raw scale.

Figure 2. The Shorkie LM architecture. The 13.7-million-parameter model reads a 16,384 bp window and predicts per-base nucleotide probabilities: a 1D convolution plus seven residual down-sampling blocks reduce the sequence to a 128 × 384 representation, eight Transformer layers integrate long-range context, and a U-Net decoder with skip connections restores it to full length before a final softmax over A/C/G/T.

Architecture diagram of the Shorkie language model: a 16,384 bp input passes through a convolutional down-sampling encoder, an eight-layer Transformer bottleneck, and a U-Net decoder back to per-base predictions.

What it showed

Three findings stand out to me.

Pretraining transfers, measurably. Fine-tuned from the fungal language model, Shorkie predicts gene-level expression with a Pearson correlation of 0.88, versus 0.74 for an identical model trained from scratch — essentially the supervised approach we had started with — and it wins on 87.8% of genes. That 0.74 → 0.88 jump is the whole pivot in a single number: the grammar learned across the fungal tree really does carry back to yeast.

It learned biology, not the answer key. This is the part I most wanted to verify. Using in-silico mutagenesis — asking the model how its prediction shifts when we perturb each base — Shorkie rediscovered canonical transcription-factor motifs without ever being told they exist: Reb1, Tye7, Cbf1, the poly(dA:dT) tracts that position nucleosomes, the TATA box, Rap1 sites in ribosomal-protein promoters, and the RRPE and PAC motifs of ribosome biogenesis. It even picked up 5′ splice-donor and branch-point signals. You can’t recover that grammar by memorizing expression values; the model had to internalize the underlying logic. You can see it directly in Figure 3: at the RPL26A locus, Shorkie’s importance scores pick out the Fhl1 and Rap1 binding sites and the 5′ splice donor, lining up with the reference annotation — while an identical model without pretraining (the Shorkie_Random_Init control) stays essentially flat.

Figure 3. Reading the grammar back out, one gene at a time. At the ribosomal-protein gene RPL26A, per-base importance from Shorkie’s language model (Shorkie LM) and from in-silico mutagenesis (Shorkie ISM) lights up the Fhl1 and Rap1 transcription-factor sites and the 5′ splice donor — matching the reference annotation at the bottom. The Shorkie_Random_Init track, an identical model without pretraining, stays essentially flat.

Shorkie importance-score logos for the RPL26A locus, highlighting the Fhl1 and Rap1 transcription-factor motifs and the 5′ splice donor, above a flat no-pretraining control track.

It’s useful for variants and for dynamics. Shorkie outperformed specialized DREAM models at classifying cis-eQTLs — the non-coding variants that nudge expression up or down — and agreed closely with massively parallel reporter assays, exactly the kind of signal you want for interpreting regulatory mutations. And because it was trained on time courses, it captures regulation as a moving target: as we induced the stress-response factors Msn2/Msn4, the model’s reliance on their STRE binding motif sharpened over the first 90 minutes, tracking the real biology of activation.

The bigger picture

For me, Shorkie is a small, clean demonstration of a principle I increasingly believe runs through computational biology: when labeled data is scarce, the way forward is to borrow statistical strength from everywhere else — across species, across assays, across time. A model trained over many genomes learns a grammar that no single genome can teach it, and self-supervised pretraining plus transfer learning beats training any one task in isolation. Yeast is the ideal place to prove that, precisely because it’s small enough to interrogate exhaustively and well-understood enough that we know when the model is right.

There’s plenty left to do. Shorkie is largely blind to 3′ splice-acceptor mutations, its resolution is coarser than single-base, and the real prize — carrying this recipe to the human genome, where the regulatory code reaches across far longer distances and the variants matter far more — is still ahead. But the throughline is clear, and it’s why I’m drawn to this kind of work: a model that reads DNA and predicts function lets us interrogate the regulatory code directly, one nucleotide at a time.

Shorkie is open source — the code and trained models are on GitHub, in keeping with something of a personal motto: build what you need, use what you build.


Read the preprint on bioRxiv, browse the code, or watch the talk. Shorkie was built with Majed Mohamed Magzoub, Emily Stoops, Sean R. Hackett, Johannes Linder, and David R. Kelley at Calico Life Sciences.

Further reading: Borzoi — predicting RNA-seq coverage from DNA sequence (Linder et al., Nature Genetics 2025), the sequence-to-expression model Shorkie builds on; and Species-aware DNA language models capture regulatory elements and their evolution (Karollus et al., Genome Biology 2024), which inspired the cross-species pretraining.