Skip to content

Stage 4: Scene Understanding using Vision Language Model #3

@Methasit-Pun

Description

@Methasit-Pun

Implement Stage 4 of the grasp pipeline: Scene Understanding using a Vision Language Model.

Requirements:

  • Extract object-object spatial relations using a vision-language model.
  • Generate human-readable summaries of the scene.
  • Use spatial relation templates: left_of, right_of, above, below, near, far_from, in_front_of, behind, touching, overlapping.
  • Integrate the following interface:
def __init__(self, model_name: str = "placeholder"):
    self.model_name = model_name
    # Spatial relation templates
    self.spatial_relations = [
        "left_of", "right_of", "above", "below", "near", "far_from", 
        "in_front_of", "behind", "touching", "overlapping"
    ]
    
def understand_scene(self, image: np.ndarray, labels: List[str], boxes: List[List[float]]) -> Dict:
    """
    Extract scene graph / relations rather than full text description
    
    Args:
        image: RGB image
        labels: List of object class labels
        boxes: List of bounding boxes [x, y, w, h]
        
    Returns:
        scene_description: Structured scene understanding with spatial relations
    """
  • The output should be a structured scene graph with spatial relations between detected objects.

This feature will help provide context-aware grasping by understanding spatial relationships in the scene.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions