Camera ready by pjhartout · Pull Request #48 · BorgwardtLab/polygraph-benchmark

pjhartout · 2026-03-05T15:10:42Z

Set of improvements for the camera-ready version

- Makefile with targets for running table/figure generation - README with usage instructions and data download link - requirements.txt with dependencies - download_data.py script for fetching pre-generated graphs

- generate_benchmark_tables.py: Standard PGD metrics with VUN - generate_mmd_tables.py: Gaussian TV and RBF MMD metrics Both scripts use the polygraph-benchmark API: - StandardPGDInterval for PGD computation - GaussianTVMMD2BenchmarkInterval/RBFMMD2BenchmarkInterval for MMD - Proper graph format conversion for DIGRESS tensor outputs

Computes PGD metrics using Logistic Regression classifier instead of TabPFN, with standard descriptors (orbit counts, degree, spectral, clustering, GIN). Uses PolyGraphDiscrepancyInterval with sklearn LogisticRegression for classifier-based evaluation.

Compares standard PGD (max over individual descriptors) vs concatenated PGD (all descriptors combined into single feature vector). Features: - ConcatenatedDescriptor class with PCA dimensionality reduction - Handles TabPFN 500-feature limit via PCA to 100 components - Uses LogisticRegression for concatenated features - Optimized subset mode for faster testing

- generate_model_quality_figures.py: Training/denoising curves - generate_perturbation_figures.py: Metric sensitivity to edge perturbations - generate_phase_plot.py: PGD vs VUN training dynamics - generate_subsampling_figures.py: Bias-variance tradeoff analysis All scripts use StandardPGDInterval from polygraph-benchmark API. Phase plot gracefully handles missing VUN values (requires graph_tool).

Rename [project] to [workspace] per updated pixi schema and correct the pypi-dependencies package name from polygraph to polygraph-benchmark.

Move the expected data path from polygraph_graphs/ to data/polygraph_graphs/ to keep generated data under the gitignored data/ directory.

Documentation is consolidated into the main README and dependencies are managed through pyproject.toml extras.

Add submitit-based cluster module for distributing reproducibility workloads across SLURM nodes. Includes YAML-configurable job parameters, job metadata tracking, and result collection helpers. - cluster.py: shared wrapper with SlurmConfig, submit_jobs, collect_results - configs/: default CPU and GPU SLURM configurations - pyproject.toml: new [cluster] optional dependency group (submitit, pyyaml)

Add --slurm-config, --local, and --collect CLI options to all four table generation scripts for distributing computation across SLURM nodes. Each script gains a standalone task function suitable for submitit, result reshaping helpers, and three execution modes (local, submit, collect). Also updates DATA_DIR paths and adds tables-submit/tables-collect Make targets.

Document the full reproducibility workflow including data download, script overview, Make targets, hardware requirements, SLURM cluster submission, and troubleshooting tips.

Include LaTeX tables and PDF figures produced by the reproducibility scripts so reviewers can verify outputs without re-running computation.

- Replace monolithic generate_*.py scripts with modular 01-08 experiment directories, each with compute.py, plot.py, and/or format.py - Add Hydra configs for all experiments with SLURM launcher support - Fix sparse feature OOM in GKLR (Bug 12), package name in graph_storage, TabPFN CPU limit workaround, and stale cache issues - Add kernel logistic regression module and async results I/O utility - Regenerate all tables with correct PGD values, subscores, and GKLR graph kernel metrics (PM/SP/WL) - Regenerate all figures including new subsampling, perturbation, model quality, and phase plot visualizations - Include all JSON result files for full reproducibility

Ensure consistent float64 dtype in kernel diagonal computation and normalization to prevent precision issues with sparse matrix outputs.

Ego dataset has 757 graphs (odd number), causing unequal reference/ perturbed splits which fails the equal-count requirement. Use half = len // 2 and slice [half : 2*half] to guarantee equal sizes. Same fix applied to proteins split for consistency.

TabPFN v2.0.9 raises ValueError when its encoder produces NaN from near-constant features after StandardScaler normalization (see github.com/PriorLabs/TabPFN/issues/108). This caused lobster/GRAN n=32 PGD subsampling to crash completely. Wrap classifier fit/predict in try/except in both the CV fold loop and the refit section. On failure, treat as indistinguishable distributions (score=0), matching the existing constant-feature fallback semantics.

Match original polygraph CombinedDescriptor behavior: per-descriptor StandardScaler + PCA, both fit on reference data only. Fix subsample size calculation to use 50% of min subset capped at 2048, matching the original experiment configuration.

Increase reference graph count from 512 to 4096 to match original experiment. Fix subsample size to 50% of min subset capped at 2048, consistent with the 2x requirement of PolyGraphDiscrepancyInterval.

Add submitit launcher config for p.hpcl94g partition with H100 GPUs, enabling faster TabPFN computation for PGD experiments.

Recompute all experiments after fixing: - KernelLogisticRegression float64 precision - Ego/proteins unequal graph splits - TabPFN NaN handling for near-constant features - Concatenation PCA pipeline - GKLR reference graph count and subsample sizes PGD subsampling: 117/120 results (3 ESGG n=4096 infeasible due to dataset size). All values within bootstrap variance of paper. Perturbation: 25/25 results including ego dataset. Benchmark, concatenation, GKLR tables: all 16/16 regenerated.

Includes TabPFN v6 classifier updates, plotting and formatting improvements across all reproducibility experiments, and added backoff/tabpfn dependencies.

Regenerate all reproducibility tables and figures using TabPFN weights v2.5 for camera-ready preparation. Add --results-suffix support to 03_model_quality/format.py. Include comparison and merge utility scripts.

Needed for PDF-to-image conversion in diff report generation.

Refactor the VUN metric to support multiprocessing for novelty and validity checks, and add a per-pair SIGALRM timeout on isomorphism to prevent hangs on pathological graph pairs. Extract shared VUN helpers into reproducibility/utils/vun.py for reuse across experiments.

Replace ad-hoc if/else branching on weights version with a version_map dict that raises on unknown versions instead of silently falling back. Applied consistently across all five compute scripts.

Add dedicated scripts to compute VUN (Valid-Unique-Novel) metrics for denoising-iteration checkpoints and benchmark results. These patch existing result JSONs with VUN values using parallel isomorphism checking.

Add CPU-only (hpcl94c) and GPU (hpcl93) SLURM launcher configs for Hydra multirun. Add experiment 09 that computes train-vs-test reference PGD values to establish metric baselines per dataset.

Add bold/underline formatting for best/second-best values per row in correlation and benchmark tables. Scale correlation values by 100 for readability. Add VUN column support in denoising PGS table. Add subscore ranking in benchmark table. Rename orbit_pgs to orbit4_pgs.

Remove hard imports of rdkit and graph_tool that prevented the test suite from loading without optional dependencies. Remove autouse=True from fixtures that are only needed by specific tests. Add requires_import() skip decorator and --skip-slow marker filtering.

DGL is incompatible with PyTorch>=2.4. Replace runtime DGL comparisons with reference values precomputed under DGL 2.3.0 / PyTorch 2.3.1, with regeneration instructions in comments.

grakel is incompatible with numpy>=2. Replace runtime grakel comparisons with reference gram matrices precomputed under grakel 0.1.10, with regeneration instructions in comments.

Replace skipif("config.getoption('--skip-slow')") with the @pytest.mark.slow decorator for consistency across the test suite.

Remove 15 files that were used during development and paper review but are not part of the reproducibility pipeline. All unique configuration they contained is already captured in the Hydra configs and compute scripts. Removed: debug utilities (_check_pgd_diffs, check_env, check_pkl), HTML comparison generators (compare_figures, compare_pgd_v2_vs_v25, compare_tables, generate_diff_report), one-off recomputation scripts (recompute_training_pgd, slurm_recompute_*), merge_v2_results, rerun_notes.md, and the generated rebuttal_vs_camera_ready_diff.html.

grakel is incompatible with numpy>=2 but the reference code should remain accessible. Restore the function with a lazy import inside the body and re-reference it in the skipped test_measure_runtime test.

- Mark slow tests (snippets, demo, TabPFN, bootstrap, graph_tool, standard PGD) so --skip-slow skips them by default - Add xdist_group markers to prevent dataset cache races and graph_tool concurrency issues under parallel execution - Add test-all pixi task for running the full suite including slow tests - Fix MockDescriptorKernel missing kernel_diag abstract method - Reduce molecule SMILES lists to 10 (sufficient for smoke tests) - Switch test output from -sv to -v --tb=short for cleaner parallel output - Use --dist loadgroup to respect xdist_group markers

- Remove unused unattr_ref variable in test_gin_metrics.py - Suppress pyright reportOptionalSubscript for csr_array.shape[0]

Add pyright as a dev dependency and local pre-commit hook so the pre-commit workflow mirrors CI (ruff check, ruff format, pyright). Fix trailing whitespace and missing EOF newlines caught by hooks.

The reproducibility workflow is fully covered by the Makefile in reproducibility/. Keep pixi tasks for dev workflow only (test, docs).

- Add __iter__ to NetworkXView so list() works on dataset views - Cast np.quantile/mean/std to float in MetricInterval.from_samples - Replace BinomConfidenceInterval namedtuple with typed class - Fix Literal type mismatches for split/variant params at call sites - Add type narrowing assertions in tests for Optional attributes - Fix matplotlib private imports (use ticker/colors modules directly) - Exclude third-party test implementations (ggm/gran) from pyright - Fix pre-commit pyright hook to use pixi run

Pyright now checks tests/ and reproducibility/ which import pytest and other dev dependencies.

This option copied generated outputs to an external paper directory, which is no longer needed.

These options are no longer needed. Output files use fixed paths directly instead of being parameterized through a suffix.

The suffix was always empty. Compute scripts now use fixed result directory names that match the hardcoded paths in plot/format scripts. The 09_train_test_reference script now embeds the tabpfn weights version directly in the directory name.

The classifier parameter previously defaulted to None with the actual TabPFN instantiation hidden deep inside _descriptions_to_classifier_metric. Now the None sentinel is resolved immediately at the top of the function, and all docstrings document that the default is TabPFN via default_classifier().

Each class now resolves None to default_classifier() in its __init__, so _classifier is always a concrete ClassifierProtocol. The internal _descriptions_to_classifier_metric now requires a classifier (keyword- only) and never sees None.

Kept only maybe_append_jsonl as the single function. Added docstrings to all public functions. Updated all 13 import sites.

- 01_subsampling/compute.py: duplicated compute_pgd.py, unused by Makefile and submit scripts - 09_train_test_reference/: experiment never integrated into the reproduction pipeline (no Makefile target, no plot/format scripts, no results)

compute.py in 01_subsampling duplicated compute_pgd.py and was unused. 09_train_test_reference is kept and will be integrated into the Makefile.

Added to compute-tables, submit-tables, and as standalone target 09.

- Removed 6 cluster-specific launcher variants (slurm_cpu_hpcl94c, slurm_cpu_large, slurm_cpu_small, slurm_gpu_fallback, slurm_gpu_h100, slurm_gpu_hpcl93). Only generic slurm_cpu and slurm_gpu remain with placeholder partitions. - Replaced hardcoded absolute paths in submit scripts with git rev-parse --show-toplevel. - Replaced cluster-specific partition names with TODO placeholders. - Cleaned up docstring references to removed launchers.

kernel_lr.py: - C1: Use K.shape[0] for alpha_init instead of len(X) - C4: Use np.logaddexp(0, -yf) for numerical stability - H1: Merge objective/gradient to eliminate redundant K @ alpha - H2: Avoid double featurization when X2 is None - L1: Remove unused random_state and project_dim parameters vun.py: - C2: Replace signal.SIGALRM timeout with ThreadPoolExecutor (works in multiprocessing workers and on Windows) - H3: Add edges="links" to nx.node_link_data/graph calls - M7: Remove section separator comments generic_descriptors.py: - M2: Replace .get() defensive defaults with direct access - M3: Remove bare except Exception in PyramidMatchDescriptor io.py: - M1: Replace hasattr with isinstance(obj, np.generic)

polygraphdiscrepancy.py: - P3: Vectorize _is_constant sparse check (col min/max vs row loop) Reproducibility scripts: - L10: Remove duplicate runtime fields (keep *_perf_seconds only) - L11: Remove pointless _fmt_pgs/_best_two aliases in format scripts - M7: Remove section separator comments across all scripts

- H6: 05_benchmark/compute_vun.py now imports compute_vun_parallel from utils.vun instead of duplicating ~120 lines of VUN logic - L5: Extracted make_tabpfn_classifier to utils/data.py, replaced 6 local copies across compute scripts - P4: EigenvalueHistogram uses scipy.sparse.linalg.eigsh for graphs with >500 nodes, avoiding dense conversion of large Laplacians

- L4: load_graphs/get_reference_dataset in 01-03 now delegate to utils/data.py instead of local copies - L6: load_results extracted to utils/formatting.py, removed from 4 format scripts - P1: Warn when kernel matrix exceeds 10k samples in kernel_lr.py

- H4: download_data.py now verifies SHA-256 hash after download, before extraction. Uses placeholder hash with TODO for now. - M8: Restore test thresholds from 0.5 to 0.7 in test_polygraphdiscrepancy.py. The test distributions (ER 0.8 vs ER 0.1) are clearly distinct; 0.5 was essentially random chance.

pjhartout added 23 commits February 5, 2026 18:45

Add reproducibility package infrastructure

fdd1591

- Makefile with targets for running table/figure generation - README with usage instructions and data download link - requirements.txt with dependencies - download_data.py script for fetching pre-generated graphs

Add GKLR table generation script

ae0df6d

Computes PGD metrics using Logistic Regression classifier instead of TabPFN, with standard descriptors (orbit counts, degree, spectral, clustering, GIN). Uses PolyGraphDiscrepancyInterval with sklearn LogisticRegression for classifier-based evaluation.

Update pixi config to new workspace schema and fix package name

5c4edef

Rename [project] to [workspace] per updated pixi schema and correct the pypi-dependencies package name from polygraph to polygraph-benchmark.

Relocate data directory to data/polygraph_graphs/

7066bc5

Move the expected data path from polygraph_graphs/ to data/polygraph_graphs/ to keep generated data under the gitignored data/ directory.

Remove standalone reproducibility README and requirements.txt

c804535

Documentation is consolidated into the main README and dependencies are managed through pyproject.toml extras.

Add reproducibility section to README

c906559

Document the full reproducibility workflow including data download, script overview, Make targets, hardware requirements, SLURM cluster submission, and troubleshooting tips.

Add pre-generated tables and figures for reference

39c5ea4

Include LaTeX tables and PDF figures produced by the reproducibility scripts so reviewers can verify outputs without re-running computation.

Fix KernelLogisticRegression float64 precision in kernel normalization

18153bc

Ensure consistent float64 dtype in kernel diagonal computation and normalization to prevent precision issues with sparse matrix outputs.

Fix GKLR experiment reference graph count and subsample sizes

2d8d3f7

Increase reference graph count from 512 to 4096 to match original experiment. Fix subsample size to 50% of min subset capped at 2048, consistent with the 2x requirement of PolyGraphDiscrepancyInterval.

Add H100 GPU SLURM launcher configuration

5617a7d

Add submitit launcher config for p.hpcl94g partition with H100 GPUs, enabling faster TabPFN computation for PGD experiments.

Update PGD computation, reproducibility scripts, and dependencies

bdcc33c

Includes TabPFN v6 classifier updates, plotting and formatting improvements across all reproducibility experiments, and added backoff/tabpfn dependencies.

Update all figures and tables to TabPFN weights v2.5

2733f2c

Regenerate all reproducibility tables and figures using TabPFN weights v2.5 for camera-ready preparation. Add --results-suffix support to 03_model_quality/format.py. Include comparison and merge utility scripts.

Rename PGS to PGD in model quality table headers

2581160

pjhartout force-pushed the camera_ready branch from ec01ce2 to 39c5ea4 Compare March 5, 2026 15:55

pjhartout added 6 commits March 10, 2026 13:43

Add pymupdf and pillow dependencies

9dd0dda

Needed for PDF-to-image conversion in diff report generation.

Refactor TabPFN classifier creation to use explicit version map

8d9c7d8

Replace ad-hoc if/else branching on weights version with a version_map dict that raises on unknown versions instead of silently falling back. Applied consistently across all five compute scripts.

Add VUN computation scripts for model quality and benchmark experiments

5067a01

Add dedicated scripts to compute VUN (Valid-Unique-Novel) metrics for denoising-iteration checkpoints and benchmark results. These patch existing result JSONs with VUN values using parallel isomorphism checking.

Add SLURM launcher configs and train-test reference experiment

6696206

Add CPU-only (hpcl94c) and GPU (hpcl93) SLURM launcher configs for Hydra multirun. Add experiment 09 that computes train-vs-test reference PGD values to establish metric baselines per dataset.

pjhartout added 30 commits March 16, 2026 13:07

Replace DGL runtime dependency with frozen reference values

6b6fd75

DGL is incompatible with PyTorch>=2.4. Replace runtime DGL comparisons with reference values precomputed under DGL 2.3.0 / PyTorch 2.3.1, with regeneration instructions in comments.

Replace grakel runtime dependency with frozen reference values

c1e98a7

grakel is incompatible with numpy>=2. Replace runtime grakel comparisons with reference gram matrices precomputed under grakel 0.1.10, with regeneration instructions in comments.

Use @pytest.mark.slow consistently for slow tests

d6980f6

Replace skipif("config.getoption('--skip-slow')") with the @pytest.mark.slow decorator for consistency across the test suite.

Restore grakel_wl_mmd function with lazy import

ef8561f

grakel is incompatible with numpy>=2 but the reference code should remain accessible. Restore the function with a lazy import inside the body and re-reference it in the skipped test_measure_runtime test.

Fix ruff and pyright CI failures

727569a

- Remove unused unattr_ref variable in test_gin_metrics.py - Suppress pyright reportOptionalSubscript for csr_array.shape[0]

Fix ruff format: remove extra blank line in test_mmd.py

ee43809

Add pyright to pre-commit hooks and fix whitespace issues

2f2cf17

Add pyright as a dev dependency and local pre-commit hook so the pre-commit workflow mirrors CI (ruff check, ruff format, pyright). Fix trailing whitespace and missing EOF newlines caught by hooks.

Remove redundant reproducibility pixi tasks

2e16eee

The reproducibility workflow is fully covered by the Makefile in reproducibility/. Keep pixi tasks for dev workflow only (test, docs).

Install dev dependencies in pyright CI workflow

b21ecce

Pyright now checks tests/ and reproducibility/ which import pytest and other dev dependencies.

Remove --paper-dir CLI option from all reproducibility scripts

56a188e

This option copied generated outputs to an external paper directory, which is no longer needed.

Remove --results-suffix and --mmd-only/--pgd-only CLI options

6e0ee2c

These options are no longer needed. Output files use fixed paths directly instead of being parameterized through a suffix.

Add docstring to _json_default serializer helper

be0b9e0

Simplify io.py: drop maybe_append_reproducibility_jsonl alias

7edbe80

Kept only maybe_append_jsonl as the single function. Added docstrings to all public functions. Updated all 13 import sites.

Remove dead compute scripts

ac311bc

- 01_subsampling/compute.py: duplicated compute_pgd.py, unused by Makefile and submit scripts - 09_train_test_reference/: experiment never integrated into the reproduction pipeline (no Makefile target, no plot/format scripts, no results)

Remove dead 01_subsampling/compute.py, restore 09_train_test_reference

4dc5d46

compute.py in 01_subsampling duplicated compute_pgd.py and was unused. 09_train_test_reference is kept and will be integrated into the Makefile.

Integrate 09_train_test_reference into Makefile

f8b4882

Added to compute-tables, submit-tables, and as standalone target 09.

Remove review-camera-ready.md

3bf5004

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Camera ready#48

Camera ready#48
pjhartout wants to merge 68 commits intomasterfrom
camera_ready

pjhartout commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pjhartout commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant