Genome assembly
Assembling the first gapless Han Chinese genome
Reference genomes shape every downstream analysis — variant calling, expression, association studies — yet they are built from only a handful of individuals. That bias quietly disadvantages anyone whose ancestry is under-represented. I’m interested in building complete, diverse references so that genomics works equally well across populations.
The problem
GRCh38, the workhorse human reference, is a mosaic of a few donors and still carries gaps and errors; the first truly complete genome, T2T-CHM13, comes from a single, near-homozygous hydatidiform-mole cell line. Neither represents the diversity of the largest ethnic group in the world — Han Chinese — and reads from such genomes map worse to a mismatched reference.
What I built
I assembled and annotated Han1, the first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual. It combines the accuracy of PacBio HiFi reads with the span of Oxford Nanopore ultra-long reads, using the T2T-CHM13 assembly as a guide to reach telomere-to-telomere completeness, followed by a full gene annotation.
Figure 1. Chromosome ideogram of the Han1 assembly across all 24 human chromosomes. Solid coloring marks contiguous, gapless sequence resolved with T2T-CHM13 as a guide; lighter bands indicate centromeric and satellite regions.
What it showed
As Figure 1 shows, Han1 spans every chromosome end to end. The assembly surfaces population-specific sequence and structural variation absent from GRCh38, and reads from Han Chinese samples align more accurately to it — concrete evidence that an ancestry-matched, complete reference improves downstream analysis.
In brief: a gapless, fully annotated Han Chinese reference that helps close the diversity gap in human genomics. (G3: Genes, Genomes, Genetics, 2023; available on NCBI.)