Skip to content

Warn when fine-tuning chromosomes overlap with pretraining data#184

Open
LiudengZhang wants to merge 1 commit intoGenentech:mainfrom
LiudengZhang:feat/warn-chrom-overlap-finetune
Open

Warn when fine-tuning chromosomes overlap with pretraining data#184
LiudengZhang wants to merge 1 commit intoGenentech:mainfrom
LiudengZhang:feat/warn-chrom-overlap-finetune

Conversation

@LiudengZhang
Copy link

Closes #59.

When calling tune_on_dataset(), this checks whether the fine-tuning dataset's chromosomes overlap with the pretrained model's training chromosomes and emits a warnings.warn() if so, since this may indicate data leakage.

The check uses chromosome metadata already stored in the checkpoint via data_params['train']['intervals']. It is skipped when either the pretrained model or the new dataset lacks interval information (e.g. string-based sequences).

Tests added:

  • test_lightning_model_finetune_chrom_overlap_warning: simulates a pretrained model trained on chr1, fine-tunes with a chr1 dataset, asserts warning fires
  • test_lightning_model_finetune_no_chrom_warning: fine-tunes with string-based dataset, asserts no warning

Add a warnings.warn() in tune_on_dataset() that checks whether the
fine-tuning dataset's chromosomes overlap with the pretrained model's
training chromosomes, which may indicate data leakage.

The check uses chromosome metadata already stored in the checkpoint
via data_params['train']['intervals']. It is skipped when either the
pretrained model or the new dataset lacks interval information.

Closes Genentech#59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check chromosomes when fine tuning?

1 participant