GPU Detection Fix for tensor01

Issue Summary

The user reported that GPU detection was only finding 1 GPU instead of the expected 8 GPUs on tensor01.dartmouth.edu. After investigation, the root cause was identified and fixed.

Root Cause

The issue was in the tensor01 configuration file (tensor01_config.yml) at line 26:

environment_variables:
  OMP_NUM_THREADS: "1"
  CUDA_VISIBLE_DEVICES: "0"  # ❌ This was limiting detection to GPU 0 only

The CUDA_VISIBLE_DEVICES: "0" setting was explicitly restricting CUDA to only see GPU 0, which caused PyTorch to report only 1 GPU instead of all 8 available GPUs.

Fix Applied

Before:

environment_variables:
  OMP_NUM_THREADS: "1"
  CUDA_VISIBLE_DEVICES: "0"

After:

environment_variables:
  OMP_NUM_THREADS: "1"
  # CUDA_VISIBLE_DEVICES: "0"  # Commented out to allow detection of all 8 GPUs

Verification

The fix removes the CUDA_VISIBLE_DEVICES restriction, allowing PyTorch to detect all available GPUs on tensor01. The configuration now allows dynamic GPU detection as requested by the user:

"we should detect how many GPUs are available; don't hard code the number of GPUs"

Test Infrastructure Improvements

Additionally implemented the requested test skipping functionality:

Dartmouth Network Detection

Added is_dartmouth_network() function to detect VPN/on-campus access
Automatically skips tensor01/ndoli tests when not on Dartmouth network
Prevents GitHub Actions failures while preserving local test functionality

Configuration

def is_dartmouth_network():
    """Check if we're on Dartmouth network (on campus or VPN)."""
    try:
        # Check hostname for .dartmouth.edu
        hostname = socket.getfqdn()
        if '.dartmouth.edu' in hostname:
            return True
        
        # Try to resolve tensor01.dartmouth.edu
        socket.gethostbyname('tensor01.dartmouth.edu')
        return True
    except:
        return False

Pytest Integration

def pytest_collection_modifyitems(config, items):
    """Automatically skip Dartmouth network tests when not on network."""
    if not is_dartmouth_network():
        skip_dartmouth = pytest.mark.skip(
            reason="Dartmouth network tests skipped (requires VPN or on-campus access)"
        )
        for item in items:
            if "dartmouth_network" in item.keywords:
                item.add_marker(skip_dartmouth)
            # Also skip specific tensor01 and ndoli tests by name
            if any(keyword in item.name.lower() for keyword in ["tensor01", "ndoli"]):
                item.add_marker(skip_dartmouth)

Current Status

✅ Fixed: GPU detection configuration corrected to allow all 8 GPUs to be detected
✅ Implemented: Automatic test skipping for non-Dartmouth networks
✅ Verified: Network detection working correctly on Dartmouth VPN
⚠️ Outstanding: VENV2 execution issues on tensor01 (separate from GPU detection)

Next Steps

GPU Detection Verification: Re-run GPU detection tests to confirm all 8 GPUs are now detected
VENV2 Debugging: Investigate the "result_raw.pkl not found - VENV2 execution may have failed" error
Performance Testing: Once VENV2 issues are resolved, test actual GPU parallelization performance

Related Files Modified

tensor01_config.yml - Removed CUDA_VISIBLE_DEVICES restriction
tests/real_world/conftest.py - Added Dartmouth network detection and automatic test skipping
GPU_DETECTION_FIX.md - This documentation file

The fix addresses the user's core requirement to detect all available GPUs dynamically rather than hard-coding GPU counts, while also implementing the requested CI/CD compatibility through automatic test skipping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Detection Fix for tensor01

Issue Summary

Root Cause

Fix Applied

Verification

Test Infrastructure Improvements

Dartmouth Network Detection

Configuration

Pytest Integration

Current Status

Next Steps

Related Files Modified

FilesExpand file tree

GPU_DETECTION_FIX.md

Latest commit

History

GPU_DETECTION_FIX.md

File metadata and controls

GPU Detection Fix for tensor01

Issue Summary

Root Cause

Fix Applied

Verification

Test Infrastructure Improvements

Dartmouth Network Detection

Configuration

Pytest Integration

Current Status

Next Steps

Related Files Modified