VISION CAPTIONER CLI - USER GUIDE

This is a command-line interface (CLI) for the VisionCaptioner tool. It allows you to batch process images and videos without opening the GUI window. You can use it to generate text captions (using Qwen-VL or Google Gemma 4), segmentation masks (using SAM3), or extract frames from videos based on a text prompt.

BASIC USAGE

Run the script using Python from your terminal or command prompt:

python cli.py --folder "/path/to/images" --mode caption

[Auto-Configuration from GUI]

By default, the CLI attempts to read settings.json (which is generated by the GUI). If you have configured your settings in the GUI previously (for Captioning, Masking, or Video Extraction), you do not need to specify them again in the CLI. Just pointing to the folder is usually enough.

GLOBAL ARGUMENTS

These arguments apply to all modes.

--folder (Required)
    Path to the folder containing your media files.
    - For 'caption' and 'mask' modes, this should be a folder of images/videos.
    - For 'video' mode, this can be a folder of videos OR a path to a single video file.

--mode
    Operation mode.
    Options: caption, mask, video
    Default: caption

--output
    Optional. The folder where extracted frames and masks from 'video' mode will be saved.
    If omitted, files are saved to the input folder.

--skip-existing
    If present, the tool skips files that already have results.
    - In Caption mode: Skips images with existing .txt files.
    - In Mask mode: Skips images with existing *-masklabel.png files.
    - In Video mode: This is not currently used.

MODE 1: GENERATING CAPTIONS

Command: --mode caption (Default)

These arguments control the captioning model for text generation. Both Qwen-VL and Google Gemma 4 model families are supported — the correct backend is selected automatically based on the model folder.

--model
    The folder name of the model (inside the /models directory) or the 
    full path to the model.

--quant
    Quantization level to control VRAM usage.
    Options: None, FP16, Int8, NF4
    Default: None

--res
    Maximum image resolution (side length). Used by Qwen models.
    Ignored for Gemma 4 (which uses --vision-tokens instead).
    Examples: 336, 512, 1024
    Default: 512

--vision-tokens
    Soft visual token budget per image for Gemma 4 models.
    Higher = more detail, slower, more VRAM. Ignored for Qwen.
    Options: 70, 140, 280, 560, 1120
    Default: model default (280)

--batch-size
    Number of images processed at once.
    Default: 4

--frame-count
    Number of frames to extract from video files for captioning.
    Default: 8

--max-tokens
    Maximum number of tokens to generate.
    Default: 1024

--prompt
    The main instruction for the AI.
    Default: "Describe this image."

--suffix
    Text appended to the end of the system prompt (e.g., negative constraints).

--trigger
    A specific word or phrase to prepend to the start of every output caption.

Captioning Examples

Example A: Use GUI settings but override the folder

python cli.py --folder "C:\Images\Dataset"

Example B: Int8 quantization with a custom prompt

python cli.py --folder "C:/Images/Dataset" --quant Int8 --prompt "Describe the lighting."

Example C: Resume a stopped job (skip existing)

python cli.py --folder "C:/Images/Dataset" --skip-existing

Example D: Caption with Gemma 4 at high detail

python cli.py --folder "C:/Images/Dataset" --model Gemma-4-E2B-it --vision-tokens 560

MODE 2: GENERATING MASKS

Command: --mode mask

These arguments control the SAM3 model for segmentation. Masks are saved as filename-masklabel.png.

--mask-prompt (Required)
    The text prompt for the object you want to mask (e.g., "person", "car", "face").

--mask-expand
    Percentage to expand the generated mask outwards.
    Useful to ensure the entire object is covered.
    Range: 0.0 to 50.0
    Default: 3.0

--mask-res
    The resolution to downscale images to during processing (saves VRAM).
    Default: 1024

--crop-to-mask
    If present, this crops the original image and generated mask to the mask's bounding box. 
    The original, uncropped image is saved for backup in a subfolder named 'uncropped', 
    and the cropped image overwrites the original file.

Masking Examples

Example A: Basic masking of a subject

python cli.py --folder "C:\Images\Portraits" --mode mask --mask-prompt "person"

Example B: Tighter mask with no expansion

python cli.py --folder "C:\Images\Cars" --mode mask --mask-prompt "car" --mask-expand 0

MODE 3: VIDEO FRAME EXTRACTION

Command: --mode video

This mode processes video files using SAM3 to find and extract frames containing a specified object. For each matching frame, it saves both the image and its segmentation mask, making it ideal for creating image datasets from video footage.

--mask-prompt (Required)
    The text prompt for the object to detect in each video frame (e.g., "person").

--video-step
    The interval for frame scanning. A value of 30 means every 30th frame is processed.
    Default: 30

--video-start
    The frame number to start processing from.
    Default: 0

--video-end
    The frame number to stop processing at.
    Default: -1 (which means the end of the video).

--video-conf
    The confidence threshold (0.0 to 1.0) for the mask detection.
    Lower values find more, but potentially incorrect, objects.
    Default: 0.25

Note: The arguments --mask-res, --mask-expand, and --crop-to-mask from mask mode are also used here to control how each extracted frame is processed.

Video Extraction Examples

Example A: Extract all frames with a person from a single video

python cli.py --folder "C:\Videos\interview.mp4" --mode video --mask-prompt "person"

Example B: Batch process a folder of videos, saving to a new dataset folder

python cli.py --folder "C:\Videos" --output "C:\Datasets\Cars" --mode video --mask-prompt "car"

Example C: Extract every 10th frame from a specific time range and crop the output

python cli.py --folder "clip.mp4" --mode video --mask-prompt "cat" --video-start 1000 --video-end 5000 --video-step 10 --crop-to-mask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VISION CAPTIONER CLI - USER GUIDE

BASIC USAGE

GLOBAL ARGUMENTS

MODE 1: GENERATING CAPTIONS

Captioning Examples

MODE 2: GENERATING MASKS

Masking Examples

MODE 3: VIDEO FRAME EXTRACTION

Video Extraction Examples

FilesExpand file tree

commandline_interface.md

Latest commit

History

commandline_interface.md

File metadata and controls

VISION CAPTIONER CLI - USER GUIDE

BASIC USAGE

GLOBAL ARGUMENTS

MODE 1: GENERATING CAPTIONS

Captioning Examples

MODE 2: GENERATING MASKS

Masking Examples

MODE 3: VIDEO FRAME EXTRACTION

Video Extraction Examples