MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

This repository maintains the codebase of the end-to-end evaluation framework of MMDeepResearch-Bench (MMDR benchmark).

✨ Key Features

🔬 Innovative Metrics for Grounded Research Quality

FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim-URL alignment.
- VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).

🛠️ Engineering & Usability

Smart Resume: Skips already-completed tasks to reduce time and API cost.
Graceful Stop: Safe shutdown via CLI (stop, exit) or Ctrl+C, ensuring partial results are flushed.
Precision Debugging: Run a single case with --quiz_first or --quiz_index.
Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.

📦 Installation

1) Clone

git clone https://github.com/YourUsername/MMDR.git
cd MMDR

2) Install dependencies

pip install -r requirements.txt

⚙️ Configuration

1) Create `.env`

cp env.txt .env

2) Edit `.env`

Example (adjust to your providers/models):

# --- Roles ---
MMDR_REPORT_PROVIDER=gemini       # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure         # recommended: strong reasoning model

# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o

# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...

🚀 Usage

1) Quick verification (recommended first run)

Run the first question only to confirm API + paths:

python run_pipeline.py --quiz_first

2) Full batch run

Process all tasks in quiz.jsonl:

python run_pipeline.py --run_id experiment_v1

3) Targeted debugging

Re-run a single item by 1-based index:

python run_pipeline.py --quiz_index 5 --run_id debug_q5

4) Parallel mode

python run_pipeline.py --max_workers 4

🎮 Runtime Controls

Command	Action
`stop` + Enter	Safely stop after current tasks finish; saves outputs
`Ctrl+C`	Triggers the same graceful shutdown behavior

📂 Output Structure

Outputs are written to reports_runs/<RUN_ID>/:

reports_runs/experiment_v1/
├── reports/                  # Markdown research reports
│   ├── Q1.md
│   └── ...
├── results/
│   └── experiment_v1.jsonl   # detailed logs (scores/errors/timings)
├── summary/
│   ├── experiment_v1.json    # machine-readable aggregated metrics
│   └── experiment_v1.txt     # human-readable summary
└── mm/                       # multimodal intermediate artifacts

📊 Metrics Explanation

The pipeline outputs three aggregate scores and one final combined score:

Aggregate	Full Name	Sub-metrics (Leaderboard)
GEN	General Quality (FLAE)	Read. (Readability), Insh. (Insightfulness), Stru. (Structure), Coherence
EVI	Evidence Quality (TRACE)	Con. (Concordance), Cov. (Coverage), Fid. (Fidelity), Diversity
MM	Multimodal Quality (MOSAIC)	Sem. (Semantic), Vef. (Faithfulness), Acc. (Data Accuracy), VQA (VQA Score)
FINAL_MMDR	Weighted combination of above	--

All sub-metrics are available in the output JSON file under aggregates.{research|all}.submetrics:

submetrics.general   ->  general.R, general.I, general.S, general.C, ...
submetrics.evidence  ->  evidence.E_con, evidence.E_cov, evidence.E_fid, evidence.E_div, ...
submetrics.mm        ->  mm.avg_metric_by_dim.semantic, .faithful, .data_accuracy, .vqa_score, ...

For detailed computation logic, see:

scoring_general.py -- GEN (FLAE)
scoring_evidence.py -- EVI (TRACE)
mm_router5_aggregate.py -- MM (MOSAIC)
accuracy.py -- VEF (verification gating)

🧾 Citation

If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:

@misc{huang2026mmdeepresearchbenchbenchmarkmultimodaldeep,
      title={MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents}, 
      author={Peizhou Huang and Zixuan Zhong and Zhongwei Wan and Donghao Zhou and Samiul Alam and Xin Wang and Zexin Li and Zhihao Dou and Li Zhu and Jing Xiong and Chaofan Tao and Yan Xu and Dimitrios Dimitriadis and Tuo Zhang and Mi Zhang},
      year={2026},
      eprint={2601.12346},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.12346}, 
}

📬 Contact and Community Results

If you run MMDR-Bench and obtain interesting results, please submit them through our Google Form:

Submit results and feedback via Google Form

We welcome reports on:

new model results
reproduction logs
implementation issues
suggestions for future benchmark extensions

📜 License

This project is released under the Apache-2.0 License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

✨ Key Features

🔬 Innovative Metrics for Grounded Research Quality

🛠️ Engineering & Usability

📦 Installation

1) Clone

2) Install dependencies

⚙️ Configuration

1) Create `.env`

2) Edit `.env`

🚀 Usage

1) Quick verification (recommended first run)

2) Full batch run

3) Targeted debugging

4) Parallel mode

🎮 Runtime Controls

📂 Output Structure

📊 Metrics Explanation

🧾 Citation

📬 Contact and Community Results

📜 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

✨ Key Features

🔬 Innovative Metrics for Grounded Research Quality

🛠️ Engineering & Usability

📦 Installation

1) Clone

2) Install dependencies

⚙️ Configuration

1) Create .env

2) Edit .env

🚀 Usage

1) Quick verification (recommended first run)

2) Full batch run

3) Targeted debugging

4) Parallel mode

🎮 Runtime Controls

📂 Output Structure

📊 Metrics Explanation

🧾 Citation

📬 Contact and Community Results

📜 License

1) Create `.env`

2) Edit `.env`