Skip to content

adding underfill case#1528

Open
ghasemiAb wants to merge 5 commits intoNVIDIA:mainfrom
ghasemiAb:underfill-ag
Open

adding underfill case#1528
ghasemiAb wants to merge 5 commits intoNVIDIA:mainfrom
ghasemiAb:underfill-ag

Conversation

@ghasemiAb
Copy link
Copy Markdown

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR adds a new CFD example (examples/cfd/underfill/) that trains a GeoTransolver surrogate model to autoregressively predict epoxy Volume-of-Fluid (VOF) flow fronts from transient CFD simulation data stored as VTP files. The architecture, training loop, interface-band loss, and VTP reader are all well-conceived additions, but the PR has several issues — two of which affect correctness — that should be resolved before merging.

Key issues found:

  • [Logic] inference.py double-counts ground-truth statistics (MAE/RMSE) per timestep when both gt_seq and source VTP files are available, because the two extraction blocks are not mutually exclusive. The reported overall error metrics will be incorrect whenever both sources exist.
  • [Logic] conf/reader/vtp.yaml has _target_: vtu_reader.Reader (a non-existent module in this PR) instead of _target_: vtp_reader.Reader, causing an ImportError for any workflow using this Hydra config override.
  • [Logic] datapipe.py saves normalization stats to a relative path ("stats") that is resolved relative to Hydra's chdir-ed output directory. Inference and validation runs in a different working directory will silently fall back to computing stats from the wrong split, potentially producing a normalization mismatch.
  • [Style] Both conf/training/default.yaml and conf/inference/default.yaml contain hardcoded personal absolute paths (/workspace/aghasemi/...) that will fail immediately for any other user.
  • [Style] config.yaml retains a stale experiment description copied from a crash-models example.
  • [Style] datapipe.py contains several commented-out code paths (_compute_feature_stats, _log_statistics) that leave dead code and obscure the intentional decision to skip feature normalization.
  • [Style] train.py uses torch.optim.Muon, which requires PyTorch ≥ 2.7 but this version requirement is not documented in the README or referenced requirements.txt.

Important Files Changed

Filename Overview
examples/cfd/underfill/inference.py Contains a logic bug where ground-truth statistics are double-appended when both gt_seq and source VTP files are available, corrupting the final MAE/RMSE/MSE statistics.
examples/cfd/underfill/conf/reader/vtp.yaml Wrong _target_: references vtu_reader.Reader instead of vtp_reader.Reader, causing an ImportError for any workflow using this config override.
examples/cfd/underfill/conf/training/default.yaml Contains hardcoded personal workspace paths that will fail for other users; also absolute_expansion: 0 silently overrides band_fraction without clear documentation.
examples/cfd/underfill/conf/config.yaml Stale experiment description copied from crash-models example; reader config block defined inline but conf/reader/ override also exists with a broken target.
examples/cfd/underfill/datapipe.py Normalization statistics directory saved to a relative path that is sensitive to Hydra's chdir; feature normalization is silently hardcoded via commented-out code, leaving dead code.
examples/cfd/underfill/train.py Well-structured training loop with per-timestep interface loss and AMP support; uses torch.optim.Muon (requires PyTorch ≥ 2.7) without documenting the version requirement.
examples/cfd/underfill/rollout.py Clean autoregressive rollout implementation with gradient checkpointing; compute_interface_band logic is correct and well-documented.
examples/cfd/underfill/vtp_reader.py Robust VTP reader with pattern-based time-series discovery and natural sort; no issues found.
examples/cfd/underfill/conf/inference/default.yaml Contains hardcoded personal workspace path and a malformed duplicate-key comment on line 20.

Reviews (1): Last reviewed commit: "adding underfill case" | Re-trigger Greptile

Comment on lines +259 to +273
if compute_error and gt_seq is not None and len(gt_seq) >= timestep:
try:
gt_np = gt_seq[t].detach().cpu().numpy().squeeze()
gt_stats = TimestepStats.from_array(gt_np)
all_gt_stats.append(gt_stats)
error = pred_np - gt_np
mae = float(np.abs(error).mean())
rmse = float(np.sqrt((error ** 2).mean()))
all_mae.append(mae)
all_rmse.append(rmse)
gt_available_count += 1
except Exception:
pass

if os.path.exists(source_file):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Double-counting statistics when both gt_seq and source files are available

The two ground-truth extraction blocks (lines 259–269 and 273–298) are not mutually exclusive. When gt_seq is provided and the source VTP file also exists, both blocks succeed and each one appends to all_gt_stats, all_mae, all_rmse, and increments gt_available_count. The result is that every timestep with both sources present is counted twice, making the reported total_mae, total_rmse, and total_mse incorrect.

The comment "Prefer dataset-provided ground truth if available" signals the intent, but the second block lacks an elif/else guard. It should only run when the first block did not already populate gt_stats.

Suggested change
if compute_error and gt_seq is not None and len(gt_seq) >= timestep:
try:
gt_np = gt_seq[t].detach().cpu().numpy().squeeze()
gt_stats = TimestepStats.from_array(gt_np)
all_gt_stats.append(gt_stats)
error = pred_np - gt_np
mae = float(np.abs(error).mean())
rmse = float(np.sqrt((error ** 2).mean()))
all_mae.append(mae)
all_rmse.append(rmse)
gt_available_count += 1
except Exception:
pass
if os.path.exists(source_file):
# Prefer dataset-provided ground truth if available
if compute_error and gt_seq is not None and len(gt_seq) >= timestep:
try:
gt_np = gt_seq[t].detach().cpu().numpy().squeeze()
gt_stats = TimestepStats.from_array(gt_np)
all_gt_stats.append(gt_stats)
error = pred_np - gt_np
mae = float(np.abs(error).mean())
rmse = float(np.sqrt((error ** 2).mean()))
all_mae.append(mae)
all_rmse.append(rmse)
gt_available_count += 1
except Exception:
pass
if os.path.exists(source_file) and gt_stats is None:

# See the License for the specific language governing permissions and
# limitations under the License.

_target_: vtu_reader.Reader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Wrong _target_ in reader config

The target class is vtu_reader.Reader, but the file is vtp.yaml and the reader module introduced in this PR is vtp_reader.py, whose class is vtp_reader.Reader. Any workflow that uses this override config (e.g., reader=vtp) will fail with an ImportError / ModuleNotFoundError.

Suggested change
_target_: vtu_reader.Reader
_target_: vtp_reader.Reader

Comment on lines +22 to +23
raw_data_dir: "/workspace/aghasemi/isv/ansys/data/converted_output_singleVTU-VTP2/train_all"
raw_data_dir_validation: "/workspace/aghasemi/isv/ansys/data/converted_output_singleVTU-VTP2/val"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded personal workspace paths

Both raw_data_dir and raw_data_dir_validation point to a specific user's workspace directory (/workspace/aghasemi/...). These will fail for any other user or machine. They should be replaced with descriptive placeholder values, consistent with how similar examples in this repository are documented.

Suggested change
raw_data_dir: "/workspace/aghasemi/isv/ansys/data/converted_output_singleVTU-VTP2/train_all"
raw_data_dir_validation: "/workspace/aghasemi/isv/ansys/data/converted_output_singleVTU-VTP2/val"
raw_data_dir: "/path/to/train_data"
raw_data_dir_validation: "/path/to/val_data"

Comment on lines +19 to +20
raw_data_dir_test: "/workspace/aghasemi/isv/ansys/data/converted_output_singleVTU-VTP2/val"
#raw_data_dir_test: raw_data_dir_test:"/workspace/isv/ansys/data/converted_output_singleVTU-VTP/val" No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Hardcoded personal path and malformed comment

raw_data_dir_test contains the same personal absolute path. Additionally, line 20 is a malformed comment that contains what looks like an accidental key duplication (raw_data_dir_test: raw_data_dir_test:"/..."); this should either be cleaned up or removed.

Suggested change
raw_data_dir_test: "/workspace/aghasemi/isv/ansys/data/converted_output_singleVTU-VTP2/val"
#raw_data_dir_test: raw_data_dir_test:"/workspace/isv/ansys/data/converted_output_singleVTU-VTP/val"
raw_data_dir_test: "/path/to/test_data"

Comment on lines +23 to +25
experiment_name: "Unified-Training"
experiment_desc: "unified training recipe for crash models"
run_desc: "unified training recipe for crash models"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale experiment description copied from another example

experiment_desc and run_desc both read "unified training recipe for crash models", which describes a completely different application. These should describe the underfill use case.

Suggested change
experiment_name: "Unified-Training"
experiment_desc: "unified training recipe for crash models"
run_desc: "unified training recipe for crash models"
experiment_name: "Underfill-Training"
experiment_desc: "GeoTransolver autoregressive rollout for transient epoxy VOF prediction"
run_desc: "GeoTransolver autoregressive rollout for transient epoxy VOF prediction"

Comment thread examples/cfd/underfill/datapipe.py Outdated
Comment on lines +383 to +418
#self.feature_stats = self._compute_feature_stats()
# Hardcode feature stats to make normalization a no-op
self.feature_stats = {
"feature_mean": torch.zeros(1, dtype=torch.float32),
"feature_std": torch.ones(1, dtype=torch.float32),
}

# Save for validation/inference (convert to pure Python types)
node_stats_serializable = _stats_to_serializable(self.node_stats)
feat_stats_serializable = _stats_to_serializable(self.feature_stats)

save_json(node_stats_serializable, node_stats_path)
save_json(feat_stats_serializable, feat_stats_path)
self._log(f" Saved statistics to {self._stats_dir}/")

else:
# Load from saved training stats
if os.path.exists(node_stats_path) and os.path.exists(feat_stats_path):
self._log(f"\n Loading statistics from {self._stats_dir}/")
self.node_stats = _stats_from_serializable(load_json(node_stats_path))
#self.feature_stats = _stats_from_serializable(load_json(feat_stats_path))
# Hardcode feature stats to make normalization a no-op
self.feature_stats = {
"feature_mean": torch.zeros(1, dtype=torch.float32),
"feature_std": torch.ones(1, dtype=torch.float32),
}
else:
self._log("\n WARNING: No saved statistics found, computing from current split")
self._log(" Run training first to generate statistics!")
self.node_stats = self._compute_node_stats()
#self.feature_stats = self._compute_feature_stats()
# Hardcode feature stats to make normalization a no-op
self.feature_stats = {
"feature_mean": torch.zeros(1, dtype=torch.float32),
"feature_std": torch.ones(1, dtype=torch.float32),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Commented-out statistics code leaves dead code and silent hardcoding

The calls to _compute_feature_stats (lines 383, 403, 412) and _log_statistics (line 421) are commented out, and feature stats are unconditionally hardcoded to mean=0, std=1 (identity). While this appears intentional (VOF is already in [0,1]), the dead code is misleading to future contributors who may not realise the normalization is a deliberate no-op. The _compute_feature_stats method (lines 458–466) remains entirely unreachable.

Consider either:

  • Removing the unused method and commented lines, and adding a clear comment explaining why feature normalization is skipped, or
  • Restoring the normalization path with appropriate documentation.

Comment thread examples/cfd/underfill/datapipe.py Outdated
Comment on lines +291 to +292
self._stats_dir = STATS_DIRNAME
os.makedirs(self._stats_dir, exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stats directory saved relative to Hydra's working directory

self._stats_dir = STATS_DIRNAME (i.e., "stats") is a relative path. Because config.yaml sets hydra.job.chdir: True, Hydra changes the working directory to ./outputs/ at runtime, so training saves stats to ./outputs/stats/. When inference or validation is later run, their working directory may be different, causing the stats load to fall back to computing from the current split — which may produce different normalization than training used. This silent normalization mismatch can degrade inference accuracy.

Consider making the stats path configurable (e.g., tied to ckpt_path) or using an absolute path derived from the configured data directory.

Comment on lines +279 to +284
muon_opt = torch.optim.Muon(
muon_params,
lr=base_lr,
weight_decay=weight_decay,
adjust_lr_fn="match_rms_adamw",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 torch.optim.Muon requires PyTorch ≥ 2.7

torch.optim.Muon was added in PyTorch 2.7. Users on earlier versions will get an AttributeError with no actionable error message. The requirements.txt (referenced in the README) should pin torch>=2.7 and the README prerequisites section should document this dependency explicitly to avoid a confusing failure at runtime.

Copy link
Copy Markdown
Collaborator

@coreyjadams coreyjadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ghasemiAb ,

Thanks for this PR! Overall I think it looks really good. I have scattered some comments here and there, mostly it is request for more organization / code clean up / better comments / docstrings etc. I think this example will get a lot of attention so I am pushing a little to encourage you to give it an extra level of polish.

The addition of a rollout example for GeoTransolver on a static mesh is pretty cool. Users have been asking about this type of workload, so this will get attention. It's really a nice addition to physicsnemo and I'm happy you've contributed it! You're choice to drive the loss function by the boundary layer is also good, rather than the static components of the mesh.

I have a couple of mandatory updates for approval, if you don't mind:

  • Please add a requirements.txt file. There are some things that have to be included there (like pyvista, for IO), and torch>(whenever muon was included, maybe that was 2.9?).
  • Your readme has some math rendering errors, can you make sure to go through it again and fix?
  • Please add some example convergence plots and visualizations to the README, if you can.

Hopefully, these aren't too much!

run:
dir: ./outputs/

experiment_name: "Unified-Training"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these names are outdated?

Comment on lines +58 to +134
def _to_python_native(value: Any) -> Any:
"""
Recursively convert tensor/numpy values to Python native types.

This ensures JSON serialization works without any numpy/torch dependencies.

Args:
value: Any value (tensor, numpy array, list, dict, scalar, etc.)

Returns:
Python native type (list, dict, float, int, etc.)
"""
if isinstance(value, torch.Tensor):
# Convert tensor to Python list
return value.detach().cpu().tolist()
elif isinstance(value, np.ndarray):
# Convert numpy array to Python list
return value.tolist()
elif isinstance(value, (np.floating, np.float32, np.float64)):
# Convert numpy float to Python float
return float(value)
elif isinstance(value, (np.integer, np.int32, np.int64)):
# Convert numpy int to Python int
return int(value)
elif isinstance(value, dict):
# Recursively convert dict values
return {k: _to_python_native(v) for k, v in value.items()}
elif isinstance(value, (list, tuple)):
# Recursively convert list/tuple elements
return [_to_python_native(v) for v in value]
elif hasattr(value, 'item'):
# Handle any other type with .item() method (scalars)
return value.item()
else:
# Already a Python native type
return value


def _to_tensor(value: Any, dtype: torch.dtype = torch.float32) -> torch.Tensor:
"""
Safely convert a value to a torch tensor.

Handles: torch.Tensor, numpy.ndarray, list, scalar values.

Args:
value: Input value to convert
dtype: Target dtype

Returns:
torch.Tensor
"""
if isinstance(value, torch.Tensor):
return value.to(dtype=dtype)
elif isinstance(value, np.ndarray):
return torch.from_numpy(value.copy()).to(dtype=dtype)
elif isinstance(value, (list, tuple)):
return torch.tensor(value, dtype=dtype)
else:
return torch.tensor(value, dtype=dtype)


def _to_numpy(value: Any) -> np.ndarray:
"""
Safely convert a value to a numpy array.

Args:
value: Input value (tensor, array, list, etc.)

Returns:
numpy.ndarray
"""
if isinstance(value, torch.Tensor):
return value.detach().cpu().numpy()
elif isinstance(value, np.ndarray):
return value
else:
return np.asarray(value)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is all of this conversion code doing? Is it necessary? Is this to save statistics or something else?

Comment on lines +30 to +37
_TIME_SERIES_PATTERNS: list[re.Pattern] = [
# <field>_step00, <field>_step01, ...
re.compile(r"^(?P<field>.+?)_step(?P<idx>\d+)$"),
# <field>_t0.000, <field>_t0.005, ... (float time label)
re.compile(r"^(?P<field>.+?)_t(?P<idx>\d+\.\d+)$"),
# <field>_00, <field>_01, ... (bare numeric suffix)
re.compile(r"^(?P<field>.+?)_(?P<idx>\d+)$"),
]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better way to do this? In general, unless there is a performance reason to want to use regular expressions I'm not a huge fan. I don't know how to cast spells so easily and its harder to maintain. Can we use some sort of glob to pattern match, unless that is too slow?


arrays: list[np.ndarray] = []
for _idx, key in entries:
arr = np.asarray(mesh.point_data[key], dtype=np.float64)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure float64 is necessary?

(train.py), not here.
"""

def __init__(self, *args, **kwargs):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this function and constructor to have a proper syntax with named arguments and docstring, unless there is a compelling reason to do it this way.

Comment on lines +54 to +91
class CombinedOptimizer(Optimizer):
"""Combine multiple PyTorch optimizers under a single Optimizer-like interface."""

def __init__(
self,
optimizers: Sequence[Optimizer],
torch_compile_kwargs: dict[str, Any] | None = None,
):
if not optimizers:
raise ValueError("`optimizers` must contain at least one optimizer.")
self.optimizers = optimizers
param_groups = [g for opt in optimizers for g in opt.param_groups]
super().__init__(param_groups, defaults={})
if torch_compile_kwargs is None:
self.step_fns: list[Callable] = [opt.step for opt in optimizers]
else:
self.step_fns: list[Callable] = [
torch.compile(opt.step, **torch_compile_kwargs) for opt in optimizers
]

def zero_grad(self, *args, **kwargs) -> None:
for opt in self.optimizers:
opt.zero_grad(*args, **kwargs)

def step(self, closure=None) -> None:
for step_fn in self.step_fns:
if closure is None:
step_fn()
else:
step_fn(closure)

def state_dict(self):
return {"optimizers": [opt.state_dict() for opt in self.optimizers]}

def load_state_dict(self, state_dict):
for opt, sd in zip(self.optimizers, state_dict["optimizers"]):
opt.load_state_dict(sd)
self.param_groups = [g for opt in self.optimizers for g in opt.param_groups]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is upstreamed in physicsnemo and needs to be removed here. Look in physicsnemo.optim now, please :)

Comment on lines +99 to +103
"""
Per-timestep interface-only loss.

Uses absolute_expansion for predictable behavior on normalized coords.
"""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand this docstring with more explanation about this loss construction and its action in the roll-out example?

Comment on lines +486 to +606
# ══════════════════════════════════════════════════════════════════════
# Pre-training summary (rank 0 only)
# ══════════════════════════════════════════════════════════════════════
if dist.rank == 0:
# ── Model parameters ──────────────────────────────────────────────
model_raw = (
trainer.model.module
if isinstance(trainer.model, DistributedDataParallel)
else trainer.model
)
total_params = sum(p.numel() for p in model_raw.parameters())
trainable_params = sum(
p.numel() for p in model_raw.parameters() if p.requires_grad
)
muon_params = sum(
p.numel() for p in model_raw.parameters() if p.ndim == 2
)
other_params = trainable_params - muon_params

logger0.info("")
logger0.info("=" * 72)
logger0.info(" TRAINING CONFIGURATION")
logger0.info("=" * 72)

# ── Data ──────────────────────────────────────────────────────────
logger0.info("")
logger0.info(" ┌─ Data ────────────────────────────────────────────────┐")
logger0.info(f" │ Train dir: {cfg.training.raw_data_dir}")
logger0.info(f" │ Validation dir: {cfg.training.raw_data_dir_validation}")
logger0.info(f" │ Train samples: {cfg.training.num_samples}")
logger0.info(f" │ Validation samples: {cfg.training.num_validation_samples}")
logger0.info(f" │ Time steps (T): {cfg.training.num_time_steps}")
logger0.info(f" │ Rollout steps: {trainer.rollout_steps}")
logger0.info(f" │ Dataloader workers: {cfg.training.num_dataloader_workers}")
logger0.info(" └────────────────────────────────────────────────────────┘")

# ── Model ─────────────────────────────────────────────────────────
logger0.info("")
logger0.info(" ┌─ Model ───────────────────────────────────────────────┐")
logger0.info(f" │ Architecture: {model_raw.__class__.__name__}")
logger0.info(f" │ Total parameters: {total_params:,}")
logger0.info(f" │ Trainable: {trainable_params:,}")
logger0.info(f" │ Muon (2D): {muon_params:,}")
logger0.info(f" │ AdamW (other): {other_params:,}")
if hasattr(model_raw, "rollout_steps"):
logger0.info(f" │ Rollout steps: {model_raw.rollout_steps}")
if hasattr(model_raw, "num_fourier_frequencies"):
logger0.info(f" │ Fourier freqs: {model_raw.num_fourier_frequencies}")
if hasattr(cfg, "model"):
model_cfg = cfg.model
for key in [
"functional_dim", "out_dim", "geometry_dim",
"slice_num", "n_layers",
]:
val = getattr(model_cfg, key, None)
if val is not None:
logger0.info(f" │ {key + ':' :<20} {val}")
logger0.info(" └────────────────────────────────────────────────────────┘")

# ── Optimization ──────────────────────────────────────────────────
scheduler_T0 = getattr(cfg.training, "scheduler_T0", 50)
scheduler_T_mult = getattr(cfg.training, "scheduler_T_mult", 2)

logger0.info("")
logger0.info(" ┌─ Optimization ────────────────────────────────────────┐")
logger0.info(f" │ Epochs: {cfg.training.epochs}")
logger0.info(f" │ Start LR: {cfg.training.start_lr}")
logger0.info(f" │ End LR (eta_min): {cfg.training.end_lr}")
logger0.info(f" │ Scheduler: CosineAnnealingWarmRestarts")
logger0.info(f" │ T_0: {scheduler_T0}")
logger0.info(f" │ T_mult: {scheduler_T_mult}")
logger0.info(f" │ Weight decay: {getattr(cfg.training, 'weight_decay', 1e-4)}")
logger0.info(f" │ Grad clip max_norm: 25.0")
logger0.info(f" │ AMP enabled: {cfg.training.amp}")
logger0.info(" └────────────────────────────────────────────────────────┘")

# ── Interface loss ────────────────────────────────────────────────
c = trainer.criterion
logger0.info("")
logger0.info(" ┌─ Interface Loss ──────────────────────────────────────┐")
logger0.info(f" │ VOF thresholds: ({c.vof_lo}, {c.vof_hi})")
logger0.info(f" │ Band fraction: {c.band_fraction}")
logger0.info(f" │ Absolute expansion: {c.absolute_expansion}")
logger0.info(f" │ Interface axis: {c.interface_axis} (-1 = auto)")
logger0.info(" └────────────────────────────────────────────────────────┘")

# ── Infrastructure ────────────────────────────────────────────────
logger0.info("")
logger0.info(" ┌─ Infrastructure ──────────────────────────────────────┐")
logger0.info(f" │ World size: {dist.world_size}")
logger0.info(f" │ Device: {dist.device}")
logger0.info(f" │ Checkpoint dir: {cfg.training.ckpt_path}")
logger0.info(f" │ TensorBoard dir: {cfg.training.tensorboard_log_dir}")
logger0.info(f" │ Save every: {cfg.training.save_chckpoint_freq} epochs")
logger0.info(f" │ Validate every: {cfg.training.validation_freq} epochs")
if trainer.epoch_init > 0:
logger0.info(f" │ Resumed from epoch: {trainer.epoch_init}")
logger0.info(" └────────────────────────────────────────────────────────┘")

# ── Per-layer parameter breakdown (compact) ───────────────────────
logger0.info("")
logger0.info(" ┌─ Layer Parameter Breakdown ───────────────────────────┐")
logger0.info(f" │ {'Layer':<40} {'Params':>10} │")
logger0.info(f" │ {'─' * 40} {'─' * 10} │")
for name, param in model_raw.named_parameters():
if param.requires_grad:
logger0.info(
f" │ {name:<40} {param.numel():>10,} │"
)
logger0.info(" └────────────────────────────────────────────────────────┘")

logger0.info("")
logger0.info(f" Total parameters: {total_params:>12,}")
logger0.info(f" Trainable parameters: {trainable_params:>12,}")
logger0.info(f" Model size: {total_params * 4 / 1024**2:>11.2f} MB (fp32)")

logger0.info("")
logger0.info("=" * 72)
logger0.info(" STARTING TRAINING")
logger0.info("=" * 72)
logger0.info("")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of boilerplate printout. You could consider a cleanup with something like tabulate but it's not mandatory.

)


def print_header(title: str, width: int = 80):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are plenty of libraries to simply work like this, FYI, that can reduce clutter in physicsnemo.

Comment on lines +160 to +169
def _to_tensor(value, dtype=torch.float32) -> torch.Tensor:
"""Safely convert a value to a torch tensor."""
if isinstance(value, torch.Tensor):
return value.to(dtype=dtype)
return torch.as_tensor(value, dtype=dtype)


def _stats_to_device(stats: dict, device: torch.device, dtype=torch.float32) -> dict:
"""Convert stats dict to tensors and move to device."""
return {k: _to_tensor(v, dtype=dtype).to(device) for k, v in stats.items()}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated in train.py, no?

@ghasemiAb
Copy link
Copy Markdown
Author

Hi @coreyjadams @ram-cherukuri I considered the comment you made already. Please check and if okay confirm the PR.

@RishikeshRanade RishikeshRanade self-requested a review April 7, 2026 17:56
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we allowed to publish this result?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will share the PR with Arvind so they can call it out if they are not ok with sharing. May be now is a good time to share the link to the readme to make sure they are ok with everything.

- feature_stats: {"feature_mean": [1], "feature_std": [1]}
"""

NUM_FEATURES = 1 # Scalar field (epoxy_vof)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of config?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the name of the example to underfill_dispensing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants