The user reported that GPU detection was only finding 1 GPU instead of the expected 8 GPUs on tensor01.dartmouth.edu. After investigation, the root cause was identified and fixed.
The issue was in the tensor01 configuration file (tensor01_config.yml) at line 26:
environment_variables:
OMP_NUM_THREADS: "1"
CUDA_VISIBLE_DEVICES: "0" # ❌ This was limiting detection to GPU 0 onlyThe CUDA_VISIBLE_DEVICES: "0" setting was explicitly restricting CUDA to only see GPU 0, which caused PyTorch to report only 1 GPU instead of all 8 available GPUs.
Before:
environment_variables:
OMP_NUM_THREADS: "1"
CUDA_VISIBLE_DEVICES: "0"After:
environment_variables:
OMP_NUM_THREADS: "1"
# CUDA_VISIBLE_DEVICES: "0" # Commented out to allow detection of all 8 GPUsThe fix removes the CUDA_VISIBLE_DEVICES restriction, allowing PyTorch to detect all available GPUs on tensor01. The configuration now allows dynamic GPU detection as requested by the user:
"we should detect how many GPUs are available; don't hard code the number of GPUs"
Additionally implemented the requested test skipping functionality:
- Added
is_dartmouth_network()function to detect VPN/on-campus access - Automatically skips tensor01/ndoli tests when not on Dartmouth network
- Prevents GitHub Actions failures while preserving local test functionality
def is_dartmouth_network():
"""Check if we're on Dartmouth network (on campus or VPN)."""
try:
# Check hostname for .dartmouth.edu
hostname = socket.getfqdn()
if '.dartmouth.edu' in hostname:
return True
# Try to resolve tensor01.dartmouth.edu
socket.gethostbyname('tensor01.dartmouth.edu')
return True
except:
return Falsedef pytest_collection_modifyitems(config, items):
"""Automatically skip Dartmouth network tests when not on network."""
if not is_dartmouth_network():
skip_dartmouth = pytest.mark.skip(
reason="Dartmouth network tests skipped (requires VPN or on-campus access)"
)
for item in items:
if "dartmouth_network" in item.keywords:
item.add_marker(skip_dartmouth)
# Also skip specific tensor01 and ndoli tests by name
if any(keyword in item.name.lower() for keyword in ["tensor01", "ndoli"]):
item.add_marker(skip_dartmouth)✅ Fixed: GPU detection configuration corrected to allow all 8 GPUs to be detected
✅ Implemented: Automatic test skipping for non-Dartmouth networks
✅ Verified: Network detection working correctly on Dartmouth VPN
- GPU Detection Verification: Re-run GPU detection tests to confirm all 8 GPUs are now detected
- VENV2 Debugging: Investigate the "result_raw.pkl not found - VENV2 execution may have failed" error
- Performance Testing: Once VENV2 issues are resolved, test actual GPU parallelization performance
tensor01_config.yml- Removed CUDA_VISIBLE_DEVICES restrictiontests/real_world/conftest.py- Added Dartmouth network detection and automatic test skippingGPU_DETECTION_FIX.md- This documentation file
The fix addresses the user's core requirement to detect all available GPUs dynamically rather than hard-coding GPU counts, while also implementing the requested CI/CD compatibility through automatic test skipping.