Skip to content

[Bug]: Missing resizing for MedSigLIP patch encoder #26

@CSMYang

Description

@CSMYang

Describe the bug

The MedSigLIP does explicit input patch resizing outside of the processor, which is missing from the current implementation. The processor function does not resize the inputs; it just adds padding and constructs a vision-language sequence. It does not throw an error if the input image size is a multiple of 16.

Check their official demonstration:
https://huggingface.co/google/medsiglip-448

To be specific, they have the following lines:

def resize(image):
    return Image.fromarray(
        tf_resize(
            images=image, size=[448, 448], method='bilinear', antialias=False
        ).numpy().astype(np.uint8)
    )


resized_imgs = [resize(img) for img in imgs]

texts = [
    "a photo of an arm with no rash",
    "a photo of an arm with a rash",
    "a photo of a leg with no rash",
    "a photo of a leg with a rash"
]

inputs = processor(text=texts, images=resized_imgs, padding="max_length", return_tensors="pt").to(device)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions