Skip to content

databio/seqcol_paper_scripts

Repository files navigation

"Taming the reference genome jungle: the refget sequence collection standard"

This repository contains scripts, metadata and archived results for the paper "Taming the reference genome jungle: the refget sequence collection standard"

A preprint of the paper can be found here: https://www.biorxiv.org/content/10.1101/2025.10.06.680641v1.full.pdf

Repository contents

Folder Description
brickyard_download/ Looper pipelines that download FASTA files from genome providers and compute GA4GH sequence collection digests. See brickyard_download/README.md.
seq_col_comparison/ Looper pipelines and scripts that compare sequence collections across providers, compute Jaccard similarities, and generate figures for the paper. See seq_col_comparison/README.md.
pephub_results_archive/ Archived CSV exports of pipeline results from PEPhub. Contains digest mappings and comparison results for human, mouse, and NCBI patch assemblies.
src/ Standalone utility scripts (e.g. populate_timestamps.py).
.env Environment variables for running on UVA Rivanna HPC.

Sample List

CONSOLIDATED_SAMPLE_LIST.csv contains all 105 reference genome FASTA files analyzed in the paper (69 human, 36 mouse) from 10 providers.

Column Description
sample_name Unique identifier for the genome assembly
common_genome_name Assembly group (e.g. hg38, mm39)
authority Provider: ncbi, ensembl, ucsc, gencode, ENA, igenomes, refgenie, ddbj, broad, 1000genomes
description Human-readable description of the assembly
Top Level Digest GA4GH refget sequence collection digest (level 1)
Number of Sequences Number of sequences (chromosomes + scaffolds) in the FASTA
brickyard_location Path to FASTA file, relative to the refgenomes_fasta brick root
downloaded_timestamp Date the FASTA was downloaded (YYYY-MM-DD)

Genome providers (authorities)

Authority Count Source
ncbi 24 NCBI Assembly database (ftp.ncbi.nlm.nih.gov). GRCh37/38 and GRCm38/39 genomic, analysis sets, and patch releases.
ucsc 18 UCSC Genome Browser (hgdownload.soe.ucsc.edu). hg18/19/38 and mm9/10/39, including masked and analysis sets.
ensembl 17 Ensembl genome browser (ftp.ensembl.org). Primary assembly, toplevel, and masked variants for human and mouse.
igenomes 16 Illumina iGenomes pre-built reference collections. Copies of assemblies from UCSC, NCBI, and Ensembl.
gencode 9 GENCODE (ftp.ebi.ac.uk/pub/databases/gencode). Primary assembly and full genome for human and mouse releases.
refgenie 6 Refgenie asset server. Previously built genome assets identified by digest.
ENA 5 European Nucleotide Archive (ebi.ac.uk/ena). ENA-hosted assembly FASTA downloads.
ddbj 4 DNA Data Bank of Japan (ddbj.nig.ac.jp). Human reference FASTA files including hs37d5.
broad 3 Broad Institute resource bundle. Analysis-ready human reference sets (hg38, with/without ALT/HLA/decoy).
1000genomes 3 1000 Genomes Project. hs37d5 and related decoy references recommended for GRCh37 alignment.

Archived results

pephub_results_archive/ contains CSV exports of pipeline results from PEPhub. See pephub_results_archive/README.md for column descriptions.

  • digest_mapping/ — Computed sequence collection digests per genome (human, mouse, NCBI patches)
  • analysis_results/ — Pairwise sequence collection comparisons with Jaccard similarities

About

Scripts used for analysis to write seq col paper.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors