Skip to content

Latest commit

 

History

History
196 lines (137 loc) · 5.74 KB

File metadata and controls

196 lines (137 loc) · 5.74 KB

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

License Python 3.8+ Benchmark: MMDR-Bench

This repository maintains the codebase of the end-to-end evaluation framework of MMDeepResearch-Bench (MMDR benchmark).


✨ Key Features

🔬 Innovative Metrics for Grounded Research Quality

  • FLAE (Formula-LLM Adaptive Evaluation): Measures report quality (readability, insightfulness, structure).
  • TRACE (Trustworthy Retrieval-Aligned Citation Evaluation): Verifies citation support and claim-URL alignment.
    • VEF (Visual Evidence Fidelity): A strict gatekeeper enforcing alignment between textual claims and visual evidence (PASS/FAIL).
  • MOSAIC (Multimodal Support-Aligned Integrity Check): Validates consistency between generated text and visual artifacts (Charts, Diagrams, Photos).

🛠️ Engineering & Usability

  • Smart Resume: Skips already-completed tasks to reduce time and API cost.
  • Graceful Stop: Safe shutdown via CLI (stop, exit) or Ctrl+C, ensuring partial results are flushed.
  • Precision Debugging: Run a single case with --quiz_first or --quiz_index.
  • Multi-Provider Support: Google Gemini, Azure OpenAI, OpenRouter.

📦 Installation

1) Clone

git clone https://github.com/YourUsername/MMDR.git
cd MMDR

2) Install dependencies

pip install -r requirements.txt

⚙️ Configuration

1) Create .env

cp env.txt .env

2) Edit .env

Example (adjust to your providers/models):

# --- Roles ---
MMDR_REPORT_PROVIDER=gemini       # gemini | azure | openrouter
MMDR_JUDGE_PROVIDER=azure         # recommended: strong reasoning model

# --- Models ---
MMDR_REPORT_MODEL=gemini-1.5-pro
MMDR_JUDGE_MODEL=gpt-4o

# --- API Keys / Endpoints ---
GEMINI_API_KEY=AIza...
AZURE_OPENAI_API_KEY=...
AZURE_OPENAI_ENDPOINT=https://...
OPENROUTER_API_KEY=...

🚀 Usage

1) Quick verification (recommended first run)

Run the first question only to confirm API + paths:

python run_pipeline.py --quiz_first

2) Full batch run

Process all tasks in quiz.jsonl:

python run_pipeline.py --run_id experiment_v1

3) Targeted debugging

Re-run a single item by 1-based index:

python run_pipeline.py --quiz_index 5 --run_id debug_q5

4) Parallel mode

python run_pipeline.py --max_workers 4

🎮 Runtime Controls

Command Action
stop + Enter Safely stop after current tasks finish; saves outputs
Ctrl+C Triggers the same graceful shutdown behavior

📂 Output Structure

Outputs are written to reports_runs/<RUN_ID>/:

reports_runs/experiment_v1/
├── reports/                  # Markdown research reports
│   ├── Q1.md
│   └── ...
├── results/
│   └── experiment_v1.jsonl   # detailed logs (scores/errors/timings)
├── summary/
│   ├── experiment_v1.json    # machine-readable aggregated metrics
│   └── experiment_v1.txt     # human-readable summary
└── mm/                       # multimodal intermediate artifacts

📊 Metrics Explanation

The pipeline outputs three aggregate scores and one final combined score:

Aggregate Full Name Sub-metrics (Leaderboard)
GEN General Quality (FLAE) Read. (Readability), Insh. (Insightfulness), Stru. (Structure), Coherence
EVI Evidence Quality (TRACE) Con. (Concordance), Cov. (Coverage), Fid. (Fidelity), Diversity
MM Multimodal Quality (MOSAIC) Sem. (Semantic), Vef. (Faithfulness), Acc. (Data Accuracy), VQA (VQA Score)
FINAL_MMDR Weighted combination of above --

All sub-metrics are available in the output JSON file under aggregates.{research|all}.submetrics:

submetrics.general   ->  general.R, general.I, general.S, general.C, ...
submetrics.evidence  ->  evidence.E_con, evidence.E_cov, evidence.E_fid, evidence.E_div, ...
submetrics.mm        ->  mm.avg_metric_by_dim.semantic, .faithful, .data_accuracy, .vqa_score, ...

For detailed computation logic, see:

  • scoring_general.py -- GEN (FLAE)
  • scoring_evidence.py -- EVI (TRACE)
  • mm_router5_aggregate.py -- MM (MOSAIC)
  • accuracy.py -- VEF (verification gating)

🧾 Citation

If you find this codebase or the MMDR-Bench dataset useful in your research, please cite:

@misc{huang2026mmdeepresearchbenchbenchmarkmultimodaldeep,
      title={MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents}, 
      author={Peizhou Huang and Zixuan Zhong and Zhongwei Wan and Donghao Zhou and Samiul Alam and Xin Wang and Zexin Li and Zhihao Dou and Li Zhu and Jing Xiong and Chaofan Tao and Yan Xu and Dimitrios Dimitriadis and Tuo Zhang and Mi Zhang},
      year={2026},
      eprint={2601.12346},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.12346}, 
}

📬 Contact and Community Results

If you run MMDR-Bench and obtain interesting results, please submit them through our Google Form:

Submit results and feedback via Google Form

We welcome reports on:

  • new model results
  • reproduction logs
  • implementation issues
  • suggestions for future benchmark extensions

📜 License

This project is released under the Apache-2.0 License. See LICENSE.