All research

Genome assembly

Building complete and diverse reference genomes

PaperCodeDataBlog

Genome assembly reconstructs chromosomes from sequencing reads, but the hardest regions are precisely where reads are difficult to place: long repeats, segmental duplications, centromeres, telomeres, ribosomal arrays, and structurally variable loci. An assembly can be highly contiguous while still collapsing two haplotypes, misjoining repeats, or omitting important sequence.

Long-read sequencing changed what is possible. PacBio HiFi reads provide long, highly accurate observations; Oxford Nanopore reads can span even larger repeats; and graph-based assemblers such as hifiasm (Cheng et al., 2021) preserve haplotype structure. The Telomere-to-Telomere Consortium showed that a human genome could be assembled through nearly all previously unresolved regions (Nurk et al., 2022), while methods such as Verkko (Rautiainen et al., 2023) pushed toward automated diploid telomere-to-telomere assembly.

Completeness is not the only goal. A useful reference also needs provenance, validation, annotation, and a clear account of which sequence came from the individual or from outside guidance. One person should not be treated as a population.

Questions that drive my work

The first question is what “complete” should mean. A complete reference should distinguish sequence assembled from an individual’s reads from regions filled using another reference, and support difficult regions with independent evidence.

The second question is how to represent diploidy. Human genomes contain two homologous copies of each autosome. Collapsing them into one mosaic can hide heterozygous sequence and structural variation, while haplotype-resolved assembly preserves maternal and paternal chromosomes separately. I see diploid assembly as a promising direction because that additional resolution supports more precise comparisons, personalized annotation, and analysis of variation between haplotypes.

The third question is whose genomes become references. GRCh38 and T2T-CHM13 are indispensable, but neither represents human diversity. The Human Pangenome Reference instead connects diverse, haplotype-resolved assemblies into a broader representation.

Finally, sequence and annotation should advance together. An assembly without genes is difficult to compare biologically, while annotation can turn assembly errors into apparent gene differences.

Directions I have explored

A complete, annotated Southern Han Chinese genome

My first Ph.D. project was Han1 (Chao et al., 2023). Inspired by T2T-CHM13, Steven Salzberg proposed completing another individual genome from a different genetic background. I chose the Southern Han Chinese sample because it matched my ethnicity, broadened the ancestry represented, and had complementary HiFi and Nanopore data.

The workflow combined de novo assembly, chromosome-scale scaffolding, reference-guided gap closing, manual review of difficult regions, polishing, and gene annotation. The resulting reference is gapless and chromosome-complete, but it is important to describe it precisely: roughly 120 Mb of the hardest sequence was filled using T2T-CHM13 guidance, so Han1 is not a purely de novo representation of every base in HG00621. The assembly records those guided regions rather than obscuring them.

Annotation enabled gene-level comparison between Han1 and T2T-CHM13. Most genes remain similar and collinear, while a smaller set intersects structural, copy-number, or predicted coding differences. The value is not to replace one reference with another, but to make those differences inspectable.

Annotating a complete diploid genome

The complete diploid human genome benchmark (Hansen et al., 2025) represents the maternal and paternal HG002 haplotypes separately. I was excited to join this team effort to advance complete diploid genomics, with my contribution centered on genome annotation. As described in my genome annotation research, I used LiftOn as part of the effort to identify additional gene copies from translated MANE transcripts on the two haplotypes.

What interests me about the result is the direction it establishes. A complete diploid reference retains differences between homologous chromosomes instead of compressing them into one sequence. That resolution makes allele- and haplotype-specific gene content easier to represent and gives annotation a more faithful substrate for personalized genomics.

Future directions

Complete, diverse, haplotype-resolved references offer a more faithful foundation for studying human variation than any single genome. The next steps are to improve diploid assembly, document difficult sequence and provenance clearly, connect individual genomes through pangenomes, and annotate both haplotypes accurately. Building on Han1 and my annotation work for the HG002 benchmark, I want to help make these references biologically comparable at the gene level, with methods that preserve individual variation and support more precise personalized genomics.

References

  1. Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods (2021). doi:10.1038/s41592-020-01056-5
  2. Nurk, S. et al. The complete sequence of a human genome. Science (2022). doi:10.1126/science.abj6987
  3. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nature Biotechnology (2023). doi:10.1038/s41587-023-01662-6
  4. Liao, W.-W. et al. A draft human pangenome reference. Nature (2023). doi:10.1038/s41586-023-05896-x
  5. Chao, K.-H. et al. The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. G3 (2023). doi:10.1093/g3journal/jkac321
  6. Hansen, N. F. et al. A complete diploid human genome benchmark for personalized genomics. bioRxiv (2025). doi:10.1101/2025.09.21.677443