Skip to content

Running simulations that require multiple nodes #299

@titoiride

Description

@titoiride

I see that the signature for the Evaluator includes the number of processes and/or GPUs, but I see that it's not possible to pass the number of nodes required to launch a given simulation.
By looking at libEnsemble it seems like this would be supported, are there any plans to implement it?
I have the problem that my scan requires to spawn MPI jobs across many nodes. I see that by passing an extra_args=extra_args="--nodes 2 --ntasks-per-node 16 --cpu-bind=none --exclusive" to a TemplateEvaluator then libEnsamble correctly sees that I'm requesting two nodes, however, it then fails with the backtrace

[0]  2026-04-09 12:32:33,173 libensemble.manager (ERROR): ---- Received error message from worker 1 ----
[0]  2026-04-09 12:32:33,173 libensemble.manager (ERROR): Message: libensemble.resources.mpi_resources.MPIResourcesException: Not enough nodes to honor arguments. Requested 2. Only 1 available
[0]  2026-04-09 12:32:33,173 libensemble.manager (ERROR): Traceback (most recent call last):
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/worker.py", line 418, in run
    response = self._handle(Work)
               ^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/worker.py", line 361, in _handle
    calc_out, persis_info, calc_status = self._handle_calc(Work, calc_in)
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/worker.py", line 279, in _handle_calc
    out = calc(calc_in, Work)
          ^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/utils/runners.py", line 54, in run
    out = self._result(calc_in, Work["persis_info"], Work["libE_info"])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/utils/runners.py", line 46, in _result
    return self.f(*args)
           ^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/optimas/sim_functions.py", line 68, in run_template_simulation
    task = Executor.executor.submit(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/executors/mpi_executor.py", line 341, in submit
    mpi_specs = mpi_runner_obj.get_mpi_specs(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/executors/mpi_runner.py", line 338, in get_mpi_specs
    nprocs, nnodes, ppn = mpi_resources.get_resources(resources, nprocs, nnodes, ppn, hyperthreads)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/resources/mpi_resources.py", line 203, in get_resources
    rassert(
  File "/global/cfs/cdirs/m558/terzani/sw/perlmutter/gpu/venvs/hipace-gpu/lib/python3.11/site-packages/libensemble/resources/mpi_resources.py", line 24, in rassert
    raise MPIResourcesException(*args)
libensemble.resources.mpi_resources.MPIResourcesException: Not enough nodes to honor arguments. Requested 2. Only 1 available

I'm running from an allocation of 4 nodes on Perlmutter requested interactively via salloc, but somehow the script doesn't see that.
I wonder is hacking the --nodes 2 into the extra_args is the correct way of doing this. Since the Evaluator already supports the number of processes and gpus, would it be feasible to implement direct support for the number of nodes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions