This repository contains scripts, metadata and archived results for the paper "Taming the reference genome jungle: the refget sequence collection standard"
A preprint of the paper can be found here: https://www.biorxiv.org/content/10.1101/2025.10.06.680641v1.full.pdf
| Folder | Description |
|---|---|
brickyard_download/ |
Looper pipelines that download FASTA files from genome providers and compute GA4GH sequence collection digests. See brickyard_download/README.md. |
seq_col_comparison/ |
Looper pipelines and scripts that compare sequence collections across providers, compute Jaccard similarities, and generate figures for the paper. See seq_col_comparison/README.md. |
pephub_results_archive/ |
Archived CSV exports of pipeline results from PEPhub. Contains digest mappings and comparison results for human, mouse, and NCBI patch assemblies. |
src/ |
Standalone utility scripts (e.g. populate_timestamps.py). |
.env |
Environment variables for running on UVA Rivanna HPC. |
CONSOLIDATED_SAMPLE_LIST.csv contains all 105 reference genome FASTA files analyzed in the paper (69 human, 36 mouse) from 10 providers.
| Column | Description |
|---|---|
sample_name |
Unique identifier for the genome assembly |
common_genome_name |
Assembly group (e.g. hg38, mm39) |
authority |
Provider: ncbi, ensembl, ucsc, gencode, ENA, igenomes, refgenie, ddbj, broad, 1000genomes |
description |
Human-readable description of the assembly |
Top Level Digest |
GA4GH refget sequence collection digest (level 1) |
Number of Sequences |
Number of sequences (chromosomes + scaffolds) in the FASTA |
brickyard_location |
Path to FASTA file, relative to the refgenomes_fasta brick root |
downloaded_timestamp |
Date the FASTA was downloaded (YYYY-MM-DD) |
| Authority | Count | Source |
|---|---|---|
ncbi |
24 | NCBI Assembly database (ftp.ncbi.nlm.nih.gov). GRCh37/38 and GRCm38/39 genomic, analysis sets, and patch releases. |
ucsc |
18 | UCSC Genome Browser (hgdownload.soe.ucsc.edu). hg18/19/38 and mm9/10/39, including masked and analysis sets. |
ensembl |
17 | Ensembl genome browser (ftp.ensembl.org). Primary assembly, toplevel, and masked variants for human and mouse. |
igenomes |
16 | Illumina iGenomes pre-built reference collections. Copies of assemblies from UCSC, NCBI, and Ensembl. |
gencode |
9 | GENCODE (ftp.ebi.ac.uk/pub/databases/gencode). Primary assembly and full genome for human and mouse releases. |
refgenie |
6 | Refgenie asset server. Previously built genome assets identified by digest. |
ENA |
5 | European Nucleotide Archive (ebi.ac.uk/ena). ENA-hosted assembly FASTA downloads. |
ddbj |
4 | DNA Data Bank of Japan (ddbj.nig.ac.jp). Human reference FASTA files including hs37d5. |
broad |
3 | Broad Institute resource bundle. Analysis-ready human reference sets (hg38, with/without ALT/HLA/decoy). |
1000genomes |
3 | 1000 Genomes Project. hs37d5 and related decoy references recommended for GRCh37 alignment. |
pephub_results_archive/ contains CSV exports of pipeline results from PEPhub. See pephub_results_archive/README.md for column descriptions.
digest_mapping/— Computed sequence collection digests per genome (human, mouse, NCBI patches)analysis_results/— Pairwise sequence collection comparisons with Jaccard similarities