Skip to content

Improve PLINK filename hashing with canonicalized parameters and SHA256#1122

Open
mithil27360 wants to merge 6 commits intomalariagen:masterfrom
mithil27360:GH1113-plink-selection-hash-canonicalization
Open

Improve PLINK filename hashing with canonicalized parameters and SHA256#1122
mithil27360 wants to merge 6 commits intomalariagen:masterfrom
mithil27360:GH1113-plink-selection-hash-canonicalization

Conversation

@mithil27360
Copy link

Fixes #1113

This PR ensures that PLINK export filenames uniquely represent the dataset
selection by hashing all content-affecting parameters.

Changes:

  • Introduce _compute_plink_selection_hash() helper to compute a deterministic
    hash from selection parameters
  • Canonicalize list inputs such as sample_sets via sorting to avoid
    order-dependent hashes
  • Use truncated SHA256 for stable, collision-resistant identifiers
  • Append the hash to the existing filename prefix while preserving human readability

Example filename: 2L.1000.2.8.0.959c5581b5

Tests added to verify:

  • Deterministic hashing (same params → same hash)
  • Different parameters produce different hashes
  • Ordering of sample_sets does not affect the result
  • Hash format and length

Mithil S and others added 2 commits March 15, 2026 03:25
Fixes malariagen#1113

Introduce _compute_plink_selection_hash() helper to compute a
deterministic hash from all content-affecting selection parameters:
sample_sets, sample_query, sample_query_options, sample_indices,
site_mask, and random_seed.

Canonicalize list inputs such as sample_sets via sorting to avoid
order-dependent hashes. Use truncated SHA256 for stable, collision-
resistant identifiers. Append the hash to the existing filename prefix
while preserving human readability.

Example filename: 2L.1000.2.8.0.959c5581b5

Tests added to verify:
- deterministic hashing (same params -> same hash)
- different parameters produce different hashes
- ordering of sample_sets does not affect the result
- hash format and length
@mithil27360
Copy link
Author

Note: this implementation canonicalizes list parameters such as sample_sets
before hashing so that calls with the same sets in different orders produce
the same filename. This ensures the filename reflects dataset content rather
than the exact argument ordering.

@mithil27360
Copy link
Author

Could a maintainer please approve the workflows to run CI? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hash PLINK export filenames by content-affecting selection parameters to prevent collisions

2 participants