Skip to content

{2025.06}[2024a] PyTorch 2.9.1#1389

Draft
bedroge wants to merge 5 commits intoEESSI:mainfrom
bedroge:pytorch291
Draft

{2025.06}[2024a] PyTorch 2.9.1#1389
bedroge wants to merge 5 commits intoEESSI:mainfrom
bedroge:pytorch291

Conversation

@bedroge
Copy link
Copy Markdown
Collaborator

@bedroge bedroge commented Feb 16, 2026

No description provided.

@bedroge bedroge added the 2025.06-software.eessi.io 2025.06 version of software.eessi.io label Feb 16, 2026
@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 16, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4

@eessi-bot-aws-eu-south
Copy link
Copy Markdown

eessi-bot-aws-eu-south bot commented Feb 16, 2026

New job on instance eessi-bot-mc-aws-eu-south for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/15

date job status comment
Feb 16 16:09:44 UTC 2026 submitted job id 15 awaits release by job manager
Feb 16 16:10:36 UTC 2026 released job awaits launch by Slurm scheduler
Feb 16 16:11:40 UTC 2026 running job 15 is running
Feb 16 16:12:41 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-15.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-17712582990.tar.zstsize: 0 MiB (22 bytes)
entries: 0
modules under 2025.06/software/linux/x86_64/amd/zen4/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen4/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 16 16:12:41 UTC 2026 test result
😢 FAILURE (click triangle for details)
Reason
EESSI test suite was not run, test step itself failed to execute.
Details
✅ job output file slurm-15.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Feb 16, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/131514

date job status comment
Feb 16 16:09:44 UTC 2026 submitted job id 131514 awaits release by job manager
Feb 16 16:09:52 UTC 2026 released job awaits launch by Slurm scheduler
Feb 16 16:10:54 UTC 2026 running job 131514 is running
Feb 16 16:11:56 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-131514.out
✅ no message matching FATAL:
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-17712582190.tar.zstsize: 0 MiB (22 bytes)
entries: 0
modules under 2025.06/software/linux/x86_64/amd/zen4/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen4/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 16 16:11:56 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen4+default
P: latency: 1.45 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen4+default
P: latency: 3.55 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen4+default
P: latency: 0.15 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen4+default
P: bandwidth: 14495.07 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-131514.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 16, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Feb 16, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/131515

date job status comment
Feb 16 16:36:49 UTC 2026 submitted job id 131515 awaits release by job manager
Feb 16 16:37:00 UTC 2026 released job awaits launch by Slurm scheduler
Feb 16 16:38:03 UTC 2026 running job 131515 is running
Feb 17 16:38:16 UTC 2026 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Job results file _bot_job131515.result does not exist in job directory or reading it failed.
  • No artefacts were found/reported.
Feb 17 16:38:16 UTC 2026 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job131515.test does not exist in job directory or reading it failed.

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 17, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen4

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Feb 17, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen4
Building for: x86_64/amd/zen4
Job dir: /project/def-users/SHARED/jobs/2026.02/pr_1389/131832

date job status comment
Feb 17 18:45:52 UTC 2026 submitted job id 131832 awaits release by job manager
Feb 17 18:46:22 UTC 2026 released job awaits launch by Slurm scheduler
Feb 17 18:52:25 UTC 2026 running job 131832 is running
Feb 19 05:59:11 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-131832.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-17714803560.tar.zstsize: 5 MiB (5327720 bytes)
entries: 1120
modules under 2025.06/software/linux/x86_64/amd/zen4/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/x86_64/amd/zen4/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260217_185237UTC
tlparse/0.4.0-GCCcore-13.3.0/20260217_185339UTC
other under 2025.06/software/linux/x86_64/amd/zen4
no other files in tarball
Feb 19 05:59:12 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen4+default
P: latency: 1.4 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen4+default
P: latency: 3.18 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen4+default
P: latency: 0.18 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen4+default
P: bandwidth: 14180.38 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-131832.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Feb 19, 2026

WARNING: 143 test failures, 0 test errors (out of 262630):
        distributed/test_c10d_functional_native (1 failed, 2 passed, 29 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)
        inductor/test_aot_inductor_arrayref (107 failed, 11 passed, 169 skipped, 0 errors)
        inductor/test_compile_subprocess (6 failed, 756 passed, 90 skipped, 0 errors)
        inductor/test_cpu_select_algorithm (1 failed, 89 passed, 1620 skipped, 0 errors)
        inductor/test_minifier (3 failed, 5 passed, 6 skipped, 0 errors)
        inductor/test_provenance_tracing (1 failed, 4 passed, 6 skipped, 0 errors)
        inductor/test_torchbind (5 failed, 10 passed, 1 skipped, 0 errors)
        inductor/test_torchinductor (6 failed, 820 passed, 87 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (6 failed, 611 passed, 231 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (6 failed, 756 passed, 149 skipped, 0 errors)

Errors are quite similar to the ones observed in #1314, many of these:

E       RuntimeError: Error in dlopen: /tmp/R8KYZy/cargdsatqw56h7ghmssrcrbgbyjsjff7hdhzslb7qz3dsz3pbati.wrapper/data/aotinductor/model/cargdsatqw56h7ghmssrcrbgbyjsjff7hdhzslb7qz3dsz3pbati.wrapper.so: cannot enable executable stack as shared object requires: Invalid argument
$ grep "cannot enable executable stack" /project/def-users/SHARED/build-logs/jobs/131832/easybuild-fsgm_e4f.log | wc -l
1716

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 11, 2026

Updated hooks file with a fix for PyTorch has been ingested (EESSI/software-layer-scripts#172), let's try again.

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-aws-eu-south for:arch=x86_64/amd/zen5
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/intel/skylake_avx512
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws-eu-south
Copy link
Copy Markdown

eessi-bot-aws-eu-south bot commented Mar 11, 2026

New job on instance eessi-bot-aws-eu-south for repository eessi.io-2025.06-software
Building on: amd-zen5
Building for: x86_64/amd/zen5
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/104

date job status comment
Mar 11 13:00:05 UTC 2026 submitted job id 104 awaits release by job manager
Mar 11 13:00:57 UTC 2026 released job awaits launch by Slurm scheduler
Mar 11 13:02:00 UTC 2026 running job 104 is running
Mar 12 11:02:15 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-104.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen5-17733132100.tar.zstsize: 165 MiB (173155391 bytes)
entries: 22819
modules under 2025.06/software/linux/x86_64/amd/zen5/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/x86_64/amd/zen5/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/x86_64/amd/zen5/reprod
PyTorch/2.9.1-foss-2024a/20260312_105947UTC
setuptools/80.9.0-GCCcore-13.3.0/20260311_130254UTC
tlparse/0.4.0-GCCcore-13.3.0/20260311_130336UTC
other under 2025.06/software/linux/x86_64/amd/zen5
no other files in tarball
Mar 12 11:02:15 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen5+default
P: latency: 1.24 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen5+default
P: latency: 2.84 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen5+default
P: latency: 0.15 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen5+default
P: bandwidth: 46332.3 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-104.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 11, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: intel-skylake_avx512
Building for: x86_64/intel/skylake_avx512
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138591

date job status comment
Mar 11 13:00:07 UTC 2026 submitted job id 138591 awaits release by job manager
Mar 11 13:01:13 UTC 2026 released job awaits launch by Slurm scheduler
Mar 11 13:08:21 UTC 2026 running job 138591 is running
Mar 12 05:05:40 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-138591.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-skylake_avx512-17732918120.tar.zstsize: 164 MiB (172308254 bytes)
entries: 22819
modules under 2025.06/software/linux/x86_64/intel/skylake_avx512/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/x86_64/intel/skylake_avx512/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/x86_64/intel/skylake_avx512/reprod
PyTorch/2.9.1-foss-2024a/20260312_050304UTC
setuptools/80.9.0-GCCcore-13.3.0/20260311_130842UTC
tlparse/0.4.0-GCCcore-13.3.0/20260311_131000UTC
other under 2025.06/software/linux/x86_64/intel/skylake_avx512
no other files in tarball
Mar 12 05:05:40 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-skylake+default
P: latency: 1.41 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-skylake+default
P: latency: 1.67 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-skylake+default
P: latency: 0.28 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-skylake+default
P: bandwidth: 10855.78 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138591.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 11, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138592

date job status comment
Mar 11 13:00:13 UTC 2026 submitted job id 138592 awaits release by job manager
Mar 11 13:01:11 UTC 2026 released job awaits launch by Slurm scheduler
Mar 11 13:06:16 UTC 2026 running job 138592 is running
Mar 11 13:28:10 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138592.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17732356160.tar.zstsize: 4 MiB (5237091 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260311_130702UTC
tlparse/0.4.0-GCCcore-13.3.0/20260311_130801UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 11 13:28:10 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.59 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.47 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.26 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 21953.88 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138592.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 11, 2026

The neoverse v1 build ran out of memory:

virtual memory exhausted: Cannot allocate memory
ninja: build stopped: subcommand failed.

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

I've modified bot/build.sh for now, so we can easily test changes in the hooks file. I guess we may need more...

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138897

date job status comment
Mar 12 15:51:59 UTC 2026 submitted job id 138897 awaits release by job manager
Mar 12 15:52:11 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 15:53:14 UTC 2026 running job 138897 is running
Mar 12 16:11:43 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138897.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17733318390.tar.zstsize: 4 MiB (5236098 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_155255UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_155350UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 12 16:11:43 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.75 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.35 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.26 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 28395.84 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138897.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge bedroge marked this pull request as draft March 12, 2026 15:52
@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc bot commented Mar 12, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.03/pr_1389/14557869

date job status comment
Mar 12 15:52:21 UTC 2026 submitted job id 14557869 awaits release by job manager
Mar 12 15:52:59 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 15:54:04 UTC 2026 running job 14557869 is running
Mar 12 23:01:17 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14557869.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17733559290.tar.gzsize: 5 MiB (6220484 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_155533UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_155902UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Mar 12 23:01:17 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /6d7a17a9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /e9b09ad8 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /a102bba0 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /d58e51e9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ OK ] ( 5/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 2.51 us (r:0, l:None, u:None)
[ OK ] ( 6/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /0c56f933 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 3.42 us (r:0, l:None, u:None)
[ OK ] ( 7/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 6.14 us (r:0, l:None, u:None)
[ OK ] ( 8/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /ca426177 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 11.14 us (r:0, l:None, u:None)
[ OK ] ( 9/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (10/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /af5b485c @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.31 us (r:0, l:None, u:None)
[ OK ] (11/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18411.29 MB/s (r:0, l:None, u:None)
[ OK ] (12/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /ebc0c2c2 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18768.48 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 8/12 test case(s) from 12 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-14557869.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/generic
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_n1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: generic
Building for: aarch64/generic
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138898

date job status comment
Mar 12 19:20:11 UTC 2026 submitted job id 138898 awaits release by job manager
Mar 12 19:21:02 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 19:26:07 UTC 2026 running job 138898 is running
Mar 12 19:57:25 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138898.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-generic-17733453230.tar.zstsize: 4 MiB (5223731 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/generic/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/generic/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/generic/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_192649UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_192804UTC
other under 2025.06/software/linux/aarch64/generic
no other files in tarball
Mar 12 19:57:25 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-generic+default
P: latency: 1.97 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-generic+default
P: latency: 5.49 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-generic+default
P: latency: 0.29 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-generic+default
P: bandwidth: 15264.28 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138898.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_n1
Building for: aarch64/neoverse_n1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138899

date job status comment
Mar 12 19:20:17 UTC 2026 submitted job id 138899 awaits release by job manager
Mar 12 19:21:04 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 19:26:09 UTC 2026 running job 138899 is running
Mar 12 19:56:23 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138899.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_n1-17733452760.tar.zstsize: 5 MiB (5243490 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_n1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_n1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_n1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_192646UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_192759UTC
other under 2025.06/software/linux/aarch64/neoverse_n1
no other files in tarball
Mar 12 19:56:23 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 1.98 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 6.3 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 0.28 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_n1+default
P: bandwidth: 16378.5 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138899.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 12, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 12, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/138900

date job status comment
Mar 12 22:13:03 UTC 2026 submitted job id 138900 awaits release by job manager
Mar 12 22:13:39 UTC 2026 released job awaits launch by Slurm scheduler
Mar 12 22:18:42 UTC 2026 running job 138900 is running
Mar 13 08:07:11 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-138900.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17733891460.tar.zstsize: 4 MiB (5236034 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260312_221833UTC
tlparse/0.4.0-GCCcore-13.3.0/20260312_221927UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 13 08:07:11 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.59 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.41 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 26832.18 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-138900.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 13, 2026

No more memory issues for the neoverse v1 build, but too many failing tests:

WARNING: 73 test failures, 0 test errors (out of 261937):
Failed tests (suites/files):
        dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_cpu_repro (5 failed, 210 passed, 526 skipped, 0 errors)
        inductor/test_cpu_select_algorithm (58 failed, 31 passed, 1621 skipped, 0 errors)
        inductor/test_fused_attention (2 failed, 45 passed, 1 skipped, 0 errors)
        test_decomp (2 failed, 8280 passed, 738 skipped, 0 errors)
        test_linalg (3 failed, 1124 passed, 118 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 13, 2026

No more memory issues for the neoverse v1 build, but too many failing tests:

WARNING: 73 test failures, 0 test errors (out of 261937):
Failed tests (suites/files):
        dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_cpu_repro (5 failed, 210 passed, 526 skipped, 0 errors)
        inductor/test_cpu_select_algorithm (58 failed, 31 passed, 1621 skipped, 0 errors)
        inductor/test_fused_attention (2 failed, 45 passed, 1 skipped, 0 errors)
        test_decomp (2 failed, 8280 passed, 738 skipped, 0 errors)
        test_linalg (3 failed, 1124 passed, 118 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

@Flamefire do you perhaps have any clue why these are failing on Neoverse V1? (could send the full log to you if that's useful)

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 13, 2026

Meanwhile, let's also check how it goes on generic and n1:

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/generic
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_n1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 13, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: generic
Building for: aarch64/generic
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/139201

date job status comment
Mar 13 09:00:09 UTC 2026 submitted job id 139201 awaits release by job manager
Mar 13 09:00:28 UTC 2026 released job awaits launch by Slurm scheduler
Mar 13 09:10:43 UTC 2026 running job 139201 is running
Mar 13 22:28:24 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-139201.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-generic-17734408060.tar.zstsize: 4 MiB (5223188 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/generic/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/generic/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/generic/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260313_091044UTC
tlparse/0.4.0-GCCcore-13.3.0/20260313_091155UTC
other under 2025.06/software/linux/aarch64/generic
no other files in tarball
Mar 13 22:28:24 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-generic+default
P: latency: 1.94 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-generic+default
P: latency: 6.14 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-generic+default
P: latency: 0.28 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-generic+default
P: bandwidth: 15847.75 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-139201.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 13, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_n1
Building for: aarch64/neoverse_n1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/139202

date job status comment
Mar 13 09:00:15 UTC 2026 submitted job id 139202 awaits release by job manager
Mar 13 09:00:32 UTC 2026 released job awaits launch by Slurm scheduler
Mar 13 09:10:47 UTC 2026 running job 139202 is running
Mar 13 22:24:14 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-139202.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_n1-17734405200.tar.zstsize: 5 MiB (5243202 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/neoverse_n1/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_n1/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_n1/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260313_091044UTC
tlparse/0.4.0-GCCcore-13.3.0/20260313_091156UTC
other under 2025.06/software/linux/aarch64/neoverse_n1
no other files in tarball
Mar 13 22:24:14 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 1.93 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 5.37 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_n1+default
P: latency: 0.28 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_n1+default
P: bandwidth: 15766.46 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-139202.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@Flamefire
Copy link
Copy Markdown
Contributor

@Flamefire do you perhaps have any clue why these are failing on Neoverse V1? (could send the full log to you if that's useful)

@bedroge Yes, tar it up and I'll take a look.

    inductor/test_cpu_repro (5 failed, 210 passed, 526 skipped, 0 errors)
   inductor/test_cpu_select_algorithm (58 failed, 31 passed, 1621 skipped, 0 errors)

Those were tricky. Maybe I recognize the issue from something I'd seen before

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 13, 2026

@bedroge Yes, tar it up and I'll take a look.

Thanks a lot! I'm attaching the log.

easybuild-0mdjours.log.gz

@Flamefire
Copy link
Copy Markdown
Contributor

Ok thanks, I did check what's going on:

dynamo/test_error_messages (1 failed, 40 passed, 0 skipped, 0 errors)

Known, can be ignored

inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
test_decomp (2 failed, 8280 passed, 738 skipped, 0 errors)

Small tolerance issue

inductor/test_cpu_repro (5 failed, 210 passed, 526 skipped, 0 errors)

Known issue on ARM 3/5 failures now skipped/xfailed upstream

inductor/test_cpu_select_algorithm (58 failed, 31 passed, 1621 skipped, 0 errors)

Fails when MKLDNN missing

inductor/test_fused_attention (2 failed, 45 passed, 1 skipped, 0 errors)

Internal compiler error

test_linalg (3 failed, 1124 passed, 118 skipped, 0 errors)

Caused by OpenBLAS: Fixed since openblas has been upgraded to 0.3.30
See pytorch/pytorch#142131

test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

Weird timing issue.

I can open a PR for EasyBuild to skip affected tests in test_cpu_repro & inductor/test_cpu_select_algorithm & test_linalg

That brings down the 73 failures to 7 which would work.

@Flamefire
Copy link
Copy Markdown
Contributor

Added patches to my still open PR that fixes some other issues in that easyconfig: easybuilders/easybuild-easyconfigs#25492

Test report coming up, I hope all is still green

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Mar 14, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=aarch64/neoverse_v1

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Mar 14, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: neoverse_v1
Building for: aarch64/neoverse_v1
Job dir: /project/def-users/SHARED/jobs/2026.03/pr_1389/139504

date job status comment
Mar 14 07:25:04 UTC 2026 submitted job id 139504 awaits release by job manager
Mar 14 07:25:11 UTC 2026 released job awaits launch by Slurm scheduler
Mar 14 07:31:13 UTC 2026 running job 139504 is running
Mar 14 16:34:41 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-139504.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-neoverse_v1-17735059990.tar.zstsize: 134 MiB (140528916 bytes)
entries: 22902
modules under 2025.06/software/linux/aarch64/neoverse_v1/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/neoverse_v1/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/neoverse_v1/reprod
PyTorch/2.9.1-foss-2024a/20260314_163250UTC
setuptools/80.9.0-GCCcore-13.3.0/20260314_073106UTC
tlparse/0.4.0-GCCcore-13.3.0/20260314_073202UTC
other under 2025.06/software/linux/aarch64/neoverse_v1
no other files in tarball
Mar 14 16:34:41 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 1.63 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 5.39 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-neoverse_v1+default
P: latency: 0.26 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-neoverse_v1+default
P: bandwidth: 29034.94 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-139504.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Mar 14, 2026

@bedroge Looks like we have a winner

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 14, 2026

@bedroge Looks like we have a winner

Awesome, thanks a lot @Flamefire!

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 14, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace

@eessi-bot-jsc
Copy link
Copy Markdown

eessi-bot-jsc bot commented Mar 14, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace
Building for: aarch64/nvidia/grace
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.03/pr_1389/14562516

date job status comment
Mar 14 18:42:37 UTC 2026 submitted job id 14562516 awaits release by job manager
Mar 14 18:43:35 UTC 2026 released job awaits launch by Slurm scheduler
Mar 14 18:44:39 UTC 2026 running job 14562516 is running
Mar 15 01:27:47 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-14562516.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-17735377580.tar.gzsize: 158 MiB (165988645 bytes)
entries: 22902
modules under 2025.06/software/linux/aarch64/nvidia/grace/modules/all
PyTorch/2.9.1-foss-2024a.lua
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/software
PyTorch/2.9.1-foss-2024a
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/reprod
PyTorch/2.9.1-foss-2024a/20260315_012058UTC
setuptools/80.9.0-GCCcore-13.3.0/20260314_184556UTC
tlparse/0.4.0-GCCcore-13.3.0/20260314_184956UTC
other under 2025.06/software/linux/aarch64/nvidia/grace
no other files in tarball
Mar 15 01:27:47 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] ( 1/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /6d7a17a9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 2/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=gpu /e9b09ad8 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 3/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /a102bba0 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ SKIP ] ( 4/12) EESSI_OSU_pt2pt_GPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /d58e51e9 @BotBuildTests:aarch64-nvidia-grace+default [Skipping GPU test : only 1 GPU available for this test case]
[ OK ] ( 5/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 2.54 us (r:0, l:None, u:None)
[ OK ] ( 6/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /0c56f933 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 3.43 us (r:0, l:None, u:None)
[ OK ] ( 7/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 6.09 us (r:0, l:None, u:None)
[ OK ] ( 8/12) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node %device_type=cpu /ca426177 @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 11.1 us (r:0, l:None, u:None)
[ OK ] ( 9/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (10/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /af5b485c @BotBuildTests:aarch64-nvidia-grace+default
P: latency: 0.31 us (r:0, l:None, u:None)
[ OK ] (11/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18598.69 MB/s (r:0, l:None, u:None)
[ OK ] (12/12) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2024a-CUDA-12.6.0 %scale=1_node /ebc0c2c2 @BotBuildTests:aarch64-nvidia-grace+default
P: bandwidth: 18653.33 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 8/12 test case(s) from 12 check(s) (0 failure(s), 4 skipped, 0 aborted)
Details
✅ job output file slurm-14562516.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-deucalion
Copy link
Copy Markdown

eessi-bot-deucalion bot commented Mar 14, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.03/pr_1389/1034652

date job status comment
Mar 14 18:42:38 UTC 2026 submitted job id 1034652 awaits release by job manager
Mar 14 18:43:32 UTC 2026 released job awaits launch by Slurm scheduler
Mar 14 18:44:35 UTC 2026 running job 1034652 is running
Mar 14 20:23:50 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-1034652.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17735195040.tar.zstsize: 5 MiB (5466852 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/a64fx/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260314_184846UTC
tlparse/0.4.0-GCCcore-13.3.0/20260314_185235UTC
other under 2025.06/software/linux/aarch64/a64fx
no other files in tarball
Mar 14 20:23:50 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.89 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 7784.23 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-1034652.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 15, 2026

The a64fx build also ran out of memory, trying again with an updated hooks file...

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-deucalion for:arch=aarch64/a64fx

@eessi-bot-deucalion
Copy link
Copy Markdown

eessi-bot-deucalion bot commented Mar 15, 2026

New job on instance eessi-bot-deucalion for repository eessi.io-2025.06-software
Building on: a64fx
Building for: aarch64/a64fx
Job dir: /home/eessibot/new-bot/jobs/2026.03/pr_1389/1034945

date job status comment
Mar 15 07:46:10 UTC 2026 submitted job id 1034945 awaits release by job manager
Mar 15 07:46:57 UTC 2026 released job awaits launch by Slurm scheduler
Mar 15 07:48:00 UTC 2026 running job 1034945 is running
Mar 17 04:03:55 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-1034945.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-a64fx-17737198570.tar.zstsize: 5 MiB (5462507 bytes)
entries: 1120
modules under 2025.06/software/linux/aarch64/a64fx/modules/all
setuptools/80.9.0-GCCcore-13.3.0.lua
tlparse/0.4.0-GCCcore-13.3.0.lua
software under 2025.06/software/linux/aarch64/a64fx/software
setuptools/80.9.0-GCCcore-13.3.0
tlparse/0.4.0-GCCcore-13.3.0
reprod directories under 2025.06/software/linux/aarch64/a64fx/reprod
setuptools/80.9.0-GCCcore-13.3.0/20260315_075210UTC
tlparse/0.4.0-GCCcore-13.3.0/20260315_075550UTC
other under 2025.06/software/linux/aarch64/a64fx
no other files in tarball
Mar 17 04:03:55 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ SKIP ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ SKIP ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:a64fx+default [Skipping test: nodes in this partition only have 30720 MiB memory available (per node) according to the current ReFrame configuration, but 49152 MiB is needed]
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:a64fx+default
P: latency: 0.88 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:a64fx+default
P: bandwidth: 8049.99 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 2/4 test case(s) from 4 check(s) (0 failure(s), 2 skipped, 0 aborted)
Details
✅ job output file slurm-1034945.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa
Copy link
Copy Markdown
Member

ocaisa commented Mar 16, 2026

@bedroge I suspect part of the memory problem is related to easybuilders/easybuild-easyblocks#4096 and we also almost certainly want easybuilders/easybuild-easyconfigs#21309 for (some?) ARM CPUs

@migueldiascosta
Copy link
Copy Markdown
Contributor

migueldiascosta commented Mar 16, 2026

@bedroge
Copy link
Copy Markdown
Collaborator Author

bedroge commented Mar 17, 2026

The build without ACL (#1389 (comment)) failed because of:

== 2026-03-17 03:55:37,772 build_log.py:454 WARNING 19 test failures, 0 test errors (out of 256500):
Failed tests (suites/files):
        distributed/tensor/test_convolution_ops (1 failed, 0 passed, 2 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_compile_subprocess (2 failed, 344 passed, 61 skipped, 0 errors)
        inductor/test_torchinductor (2 failed, 385 passed, 52 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (2 failed, 263 passed, 108 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 347 passed, 120 skipped, 0 errors)
        test_linalg (8 failed, 1116 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

@Flamefire
Copy link
Copy Markdown
Contributor

The build without ACL (#1389 (comment)) failed because of:

== 2026-03-17 03:55:37,772 build_log.py:454 WARNING 19 test failures, 0 test errors (out of 256500):
Failed tests (suites/files):
        distributed/tensor/test_convolution_ops (1 failed, 0 passed, 2 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_compile_subprocess (2 failed, 344 passed, 61 skipped, 0 errors)
        inductor/test_torchinductor (2 failed, 385 passed, 52 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (2 failed, 263 passed, 108 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 347 passed, 120 skipped, 0 errors)
        test_linalg (8 failed, 1116 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

I had ignored the last one because there is some weird timing issue: It basically starts n processes serially doing a sleep and asserting the passed time is at least n*sleeptime which fails with not (7.4>=4*5) which I can't explain. There is a skip for Python >= 3.13.8
I should have fixed test_binary_folding (negligible accuracy difference)
The failures in test_linalg have increased for some reason, the others are new.
the 2 failing in inductor/test_torchinductor* are likely all the same test, so the same issue.

I can take a look at the log again or just increase allowed failures to 20

@boegel
Copy link
Copy Markdown
Contributor

boegel commented Mar 17, 2026

The build without ACL (#1389 (comment)) failed because of:

== 2026-03-17 03:55:37,772 build_log.py:454 WARNING 19 test failures, 0 test errors (out of 256500):
Failed tests (suites/files):
        distributed/tensor/test_convolution_ops (1 failed, 0 passed, 2 skipped, 0 errors)
        inductor/test_binary_folding (1 failed, 2 passed, 0 skipped, 0 errors)
        inductor/test_compile_subprocess (2 failed, 344 passed, 61 skipped, 0 errors)
        inductor/test_torchinductor (2 failed, 385 passed, 52 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (2 failed, 263 passed, 108 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 347 passed, 120 skipped, 0 errors)
        test_linalg (8 failed, 1116 passed, 121 skipped, 0 errors)
        test_multiprocessing_spawn (1 failed, 27 passed, 3 skipped, 0 errors)

I had ignored the last one because there is some weird timing issue: It basically starts n processes serially doing a sleep and asserting the passed time is at least n*sleeptime which fails with not (7.4>=4*5) which I can't explain. There is a skip for Python >= 3.13.8 I should have fixed test_binary_folding (negligible accuracy difference) The failures in test_linalg have increased for some reason, the others are new. the 2 failing in inductor/test_torchinductor* are likely all the same test, so the same issue.

I can take a look at the log again or just increase allowed failures to 20

Generally speaking I would say that this is a very impressive result for the test suite on A64FX...

We should probably take a closer look at the test_linalg failures a bit more closely, but the rest doesn't seem to be a blocker I would say...

@migueldiascosta
Copy link
Copy Markdown
Contributor

@ocaisa

Confirmed that easybuilders/easybuild-easyblocks#4096 is enough: on a64fx with ACL as a dependency, training on CIFAR-100 (the benchmark where we originally found that ACL made a big difference) is ~4.75x times faster than without the ACL dependency

@Flamefire
Copy link
Copy Markdown
Contributor

on a64fx with ACL as a dependency, training on CIFAR-100 (the benchmark where we originally found that ACL made a big difference) is ~4.75x times faster than without the ACL dependency

Then we should add it as an architecture specific dependency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants