This is a command-line interface (CLI) for the VisionCaptioner tool. It allows you to batch process images and videos without opening the GUI window. You can use it to generate text captions (using Qwen-VL or Google Gemma 4), segmentation masks (using SAM3), or extract frames from videos based on a text prompt.
Run the script using Python from your terminal or command prompt:
python cli.py --folder "/path/to/images" --mode caption[Auto-Configuration from GUI]
By default, the CLI attempts to read settings.json (which is generated by the GUI).
If you have configured your settings in the GUI previously (for Captioning, Masking, or Video Extraction), you do not need to specify them again in the CLI. Just pointing to the folder is usually enough.
These arguments apply to all modes.
--folder (Required)
Path to the folder containing your media files.
- For 'caption' and 'mask' modes, this should be a folder of images/videos.
- For 'video' mode, this can be a folder of videos OR a path to a single video file.
--mode
Operation mode.
Options: caption, mask, video
Default: caption
--output
Optional. The folder where extracted frames and masks from 'video' mode will be saved.
If omitted, files are saved to the input folder.
--skip-existing
If present, the tool skips files that already have results.
- In Caption mode: Skips images with existing .txt files.
- In Mask mode: Skips images with existing *-masklabel.png files.
- In Video mode: This is not currently used.Command: --mode caption (Default)
These arguments control the captioning model for text generation. Both Qwen-VL and Google Gemma 4 model families are supported — the correct backend is selected automatically based on the model folder.
--model
The folder name of the model (inside the /models directory) or the
full path to the model.
--quant
Quantization level to control VRAM usage.
Options: None, FP16, Int8, NF4
Default: None
--res
Maximum image resolution (side length). Used by Qwen models.
Ignored for Gemma 4 (which uses --vision-tokens instead).
Examples: 336, 512, 1024
Default: 512
--vision-tokens
Soft visual token budget per image for Gemma 4 models.
Higher = more detail, slower, more VRAM. Ignored for Qwen.
Options: 70, 140, 280, 560, 1120
Default: model default (280)
--batch-size
Number of images processed at once.
Default: 4
--frame-count
Number of frames to extract from video files for captioning.
Default: 8
--max-tokens
Maximum number of tokens to generate.
Default: 1024
--prompt
The main instruction for the AI.
Default: "Describe this image."
--suffix
Text appended to the end of the system prompt (e.g., negative constraints).
--trigger
A specific word or phrase to prepend to the start of every output caption.Example A: Use GUI settings but override the folder
python cli.py --folder "C:\Images\Dataset"Example B: Int8 quantization with a custom prompt
python cli.py --folder "C:/Images/Dataset" --quant Int8 --prompt "Describe the lighting."Example C: Resume a stopped job (skip existing)
python cli.py --folder "C:/Images/Dataset" --skip-existingExample D: Caption with Gemma 4 at high detail
python cli.py --folder "C:/Images/Dataset" --model Gemma-4-E2B-it --vision-tokens 560Command: --mode mask
These arguments control the SAM3 model for segmentation. Masks are saved as filename-masklabel.png.
--mask-prompt (Required)
The text prompt for the object you want to mask (e.g., "person", "car", "face").
--mask-expand
Percentage to expand the generated mask outwards.
Useful to ensure the entire object is covered.
Range: 0.0 to 50.0
Default: 3.0
--mask-res
The resolution to downscale images to during processing (saves VRAM).
Default: 1024
--crop-to-mask
If present, this crops the original image and generated mask to the mask's bounding box.
The original, uncropped image is saved for backup in a subfolder named 'uncropped',
and the cropped image overwrites the original file.Example A: Basic masking of a subject
python cli.py --folder "C:\Images\Portraits" --mode mask --mask-prompt "person"Example B: Tighter mask with no expansion
python cli.py --folder "C:\Images\Cars" --mode mask --mask-prompt "car" --mask-expand 0Command: --mode video
This mode processes video files using SAM3 to find and extract frames containing a specified object. For each matching frame, it saves both the image and its segmentation mask, making it ideal for creating image datasets from video footage.
--mask-prompt (Required)
The text prompt for the object to detect in each video frame (e.g., "person").
--video-step
The interval for frame scanning. A value of 30 means every 30th frame is processed.
Default: 30
--video-start
The frame number to start processing from.
Default: 0
--video-end
The frame number to stop processing at.
Default: -1 (which means the end of the video).
--video-conf
The confidence threshold (0.0 to 1.0) for the mask detection.
Lower values find more, but potentially incorrect, objects.
Default: 0.25Note: The arguments --mask-res, --mask-expand, and --crop-to-mask from mask mode are also used here to control how each extracted frame is processed.
Example A: Extract all frames with a person from a single video
python cli.py --folder "C:\Videos\interview.mp4" --mode video --mask-prompt "person"Example B: Batch process a folder of videos, saving to a new dataset folder
python cli.py --folder "C:\Videos" --output "C:\Datasets\Cars" --mode video --mask-prompt "car"Example C: Extract every 10th frame from a specific time range and crop the output
python cli.py --folder "clip.mp4" --mode video --mask-prompt "cat" --video-start 1000 --video-end 5000 --video-step 10 --crop-to-mask