Modeling gene regulation from DNA sequence

Every cell contains essentially the same genome, yet cells use genes at different levels and at different times. Part of this regulatory program is encoded in DNA through promoters, transcription-factor binding sites, and combinations of elements whose effects depend on sequence context.

I study sequence-to-function models because they provide a direct way to connect DNA with molecular phenotypes such as binding, chromatin accessibility, and RNA abundance. Models including Enformer (Avsec et al., 2021) and Borzoi (Linder et al., 2025) show that long sequence context can support detailed predictions of regulatory activity. I am interested not only in prediction, but also in whether these models can identify consequential bases, anticipate variant effects, and transfer what they learn across biological settings.

Questions that drive my work

The first question is generalization. A genomic model can fit abundant measurements while relying on correlations that do not transfer to unseen genes, conditions, or species. Evaluation should make memorization difficult and test whether the model learned reusable biology.

The second is where prior knowledge should come from. Functional labels are expensive and concentrated in a few organisms, while genome sequence is abundant. Evolution leaves patterns of conservation and divergence that self-supervised models can learn. Species-aware language models show that evolutionary context can expose regulatory structure (Karollus et al., 2024), but the best pretraining scope depends on the target organism and task. A larger corpus is not automatically better if distant genomes dilute the relevant signal.

The third is biological interpretation. Attribution and in-silico mutagenesis can highlight important bases, but a score is not yet a mechanism. I look for agreement with known motifs, controlled perturbations, independent variant measurements, and condition-specific responses. A useful model should produce hypotheses that can be tested rather than explanations that must be accepted on faith.

Directions I have explored

My main project in this direction is Shorkie (Chao et al., 2025), a fungal DNA language model for budding yeast. We first trained a supervised sequence-to-expression model directly. In our experiments, it learned the genes it had seen but did not generalize as strongly as we wanted to unseen genes. This led us to test whether self-supervised pretraining on related fungal genomes could provide a better starting point.

We trained the same masked-DNA model across nested evolutionary scopes: one Saccharomyces cerevisiae reference, multiple strains, the Saccharomycetales order, and the fungal kingdom. Pretraining helped, but the largest corpus was not the best. The strongest transfer came from 165 Saccharomycetales genomes, which supplied useful diversity while remaining close enough to budding yeast for the learned patterns to transfer.

We then fine-tuned the model on high-resolution regulatory measurements, including time-course RNA-seq after transcription-factor induction. This connects static sequence with a dynamic response. Motif sensitivity and regulatory-variant scoring provide additional tests of whether pretraining contributed biological information rather than only improving optimization.

Connecting observation with intervention

Sequence models learn from naturally occurring variation and observational assays. Perturbational data ask what happens when a gene or pathway is deliberately changed. My contribution to a large genome-wide Perturb-seq effort (You et al., 2026) is adjacent to Shorkie, but it supports the same broader goal: connecting genomic information, cellular interventions, and expression responses at scale.

Future directions

Future work in gene-regulation modeling should connect evolutionary pretraining, natural variation, functional assays, and controlled perturbations within the same evaluation framework. A promising next step is to test when these evidence sources agree and use their disagreements to design informative experiments. Building on Shorkie and large-scale perturbational data, I want to develop sequence models that generalize across genes and conditions, reveal testable regulatory hypotheses, and help distinguish predictive correlation from biological mechanism.

Modeling gene regulation from DNA sequence

Questions that drive my work

Directions I have explored

Learning regulatory priors from related genomes

Connecting observation with intervention

Future directions

Related work

Publications

Predicting dynamic expression patterns in budding yeast with a fungal DNA language model

P588: AI and drug discovery with 100 million cells of genome-wide Perturb-seq

Research blog posts

References