Skip to content

Latest commit

 

History

History
102 lines (77 loc) · 3.72 KB

File metadata and controls

102 lines (77 loc) · 3.72 KB

GPU Detection Fix for tensor01

Issue Summary

The user reported that GPU detection was only finding 1 GPU instead of the expected 8 GPUs on tensor01.dartmouth.edu. After investigation, the root cause was identified and fixed.

Root Cause

The issue was in the tensor01 configuration file (tensor01_config.yml) at line 26:

environment_variables:
  OMP_NUM_THREADS: "1"
  CUDA_VISIBLE_DEVICES: "0"  # ❌ This was limiting detection to GPU 0 only

The CUDA_VISIBLE_DEVICES: "0" setting was explicitly restricting CUDA to only see GPU 0, which caused PyTorch to report only 1 GPU instead of all 8 available GPUs.

Fix Applied

Before:

environment_variables:
  OMP_NUM_THREADS: "1"
  CUDA_VISIBLE_DEVICES: "0"

After:

environment_variables:
  OMP_NUM_THREADS: "1"
  # CUDA_VISIBLE_DEVICES: "0"  # Commented out to allow detection of all 8 GPUs

Verification

The fix removes the CUDA_VISIBLE_DEVICES restriction, allowing PyTorch to detect all available GPUs on tensor01. The configuration now allows dynamic GPU detection as requested by the user:

"we should detect how many GPUs are available; don't hard code the number of GPUs"

Test Infrastructure Improvements

Additionally implemented the requested test skipping functionality:

Dartmouth Network Detection

  • Added is_dartmouth_network() function to detect VPN/on-campus access
  • Automatically skips tensor01/ndoli tests when not on Dartmouth network
  • Prevents GitHub Actions failures while preserving local test functionality

Configuration

def is_dartmouth_network():
    """Check if we're on Dartmouth network (on campus or VPN)."""
    try:
        # Check hostname for .dartmouth.edu
        hostname = socket.getfqdn()
        if '.dartmouth.edu' in hostname:
            return True
        
        # Try to resolve tensor01.dartmouth.edu
        socket.gethostbyname('tensor01.dartmouth.edu')
        return True
    except:
        return False

Pytest Integration

def pytest_collection_modifyitems(config, items):
    """Automatically skip Dartmouth network tests when not on network."""
    if not is_dartmouth_network():
        skip_dartmouth = pytest.mark.skip(
            reason="Dartmouth network tests skipped (requires VPN or on-campus access)"
        )
        for item in items:
            if "dartmouth_network" in item.keywords:
                item.add_marker(skip_dartmouth)
            # Also skip specific tensor01 and ndoli tests by name
            if any(keyword in item.name.lower() for keyword in ["tensor01", "ndoli"]):
                item.add_marker(skip_dartmouth)

Current Status

Fixed: GPU detection configuration corrected to allow all 8 GPUs to be detected
Implemented: Automatic test skipping for non-Dartmouth networks
Verified: Network detection working correctly on Dartmouth VPN
⚠️ Outstanding: VENV2 execution issues on tensor01 (separate from GPU detection)

Next Steps

  1. GPU Detection Verification: Re-run GPU detection tests to confirm all 8 GPUs are now detected
  2. VENV2 Debugging: Investigate the "result_raw.pkl not found - VENV2 execution may have failed" error
  3. Performance Testing: Once VENV2 issues are resolved, test actual GPU parallelization performance

Related Files Modified

  • tensor01_config.yml - Removed CUDA_VISIBLE_DEVICES restriction
  • tests/real_world/conftest.py - Added Dartmouth network detection and automatic test skipping
  • GPU_DETECTION_FIX.md - This documentation file

The fix addresses the user's core requirement to detect all available GPUs dynamically rather than hard-coding GPU counts, while also implementing the requested CI/CD compatibility through automatic test skipping.