Stage 4: Scene Understanding using Vision Language Model

Implement Stage 4 of the grasp pipeline: Scene Understanding using a Vision Language Model.

Requirements:
- Extract object-object spatial relations using a vision-language model.
- Generate human-readable summaries of the scene.
- Use spatial relation templates: left_of, right_of, above, below, near, far_from, in_front_of, behind, touching, overlapping.
- Integrate the following interface:

```python
def __init__(self, model_name: str = "placeholder"):
    self.model_name = model_name
    # Spatial relation templates
    self.spatial_relations = [
        "left_of", "right_of", "above", "below", "near", "far_from", 
        "in_front_of", "behind", "touching", "overlapping"
    ]
    
def understand_scene(self, image: np.ndarray, labels: List[str], boxes: List[List[float]]) -> Dict:
    """
    Extract scene graph / relations rather than full text description
    
    Args:
        image: RGB image
        labels: List of object class labels
        boxes: List of bounding boxes [x, y, w, h]
        
    Returns:
        scene_description: Structured scene understanding with spatial relations
    """
```
- The output should be a structured scene graph with spatial relations between detected objects.

This feature will help provide context-aware grasping by understanding spatial relationships in the scene.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 4: Scene Understanding using Vision Language Model #3

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage 4: Scene Understanding using Vision Language Model #3

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions