Skip to content

Applied-Machine-Learning-Lab/WWW2026_LAGMiD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LAGMiD

Implementation code for the paper Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning.

LAGMiD addresses miscitation detection on the scholarly web by combining three components in a unified framework:

  1. LLM-based evidence-chain reasoning over text-rich citation graphs.
  2. LLM-to-GNN knowledge distillation with hop-level alignment.
  3. Iterative collaborative learning driven by predictive uncertainty.

Overview

Given a citation edge, LAGMiD first extracts a multi-hop evidence chain from the citation graph. The teacher LLM then performs stepwise verification and chain-of-thought reasoning over this evidence chain. The resulting hop-level reasoning states are distilled into a student GNN, whose edge representations are trained for scalable miscitation detection. During training, uncertainty-based routing selectively sends difficult cases to the teacher, which improves efficiency while preserving semantic supervision.

image

Repository Structure

lagmid/
  config.py            # configuration loading and validation
  data.py              # citation graph dataset
  evidence.py          # evidence-chain extraction
  prompts.py           # teacher prompts for verify / CoT / judge
  teacher.py           # causal LLM teacher
  text_backends.py     # SciBERT-based node/edge encoder
  model.py             # student GNN and edge classifier
  trainer.py           # training, distillation, collaboration
  metrics.py           # evaluation metrics
  utils.py             # I/O and utility helpers
scripts/
  train.py             # training entry point
  predict.py           # inference entry point
configs/
  lagmid.yaml          # paper-style configuration

Data Format

The input directory must contain two JSONL files.

nodes.jsonl

Each line is one publication node:

{"node_id":"P1","title":"...","abstract":"..."}

edges.jsonl

Each line is one directed citation edge:

{"edge_id":"E1","source":"P4","target":"P1","claim_text":"...","label":0,"split":"train"}

Field definitions:

  • label = 1: miscitation
  • label = 0: valid citation
  • split: train, val, or test

Environment

Install dependencies:

pip install -r requirements.txt

Training

Edit configs/lagmid.yaml first, especially:

  • data.data_dir
  • encoder.hf_model_name
  • teacher.hf_model_name
  • encoder.device and teacher.device

Then run:

python scripts/train.py --config configs/lagmid.yaml

Training outputs are written to train.project_dir, including:

  • teacher_cache.json
  • best_model.pt
  • history.json
  • metrics.json

Inference

python scripts/predict.py --config configs/lagmid.yaml --checkpoint runs/lagmid/best_model.pt

The script prints one miscitation score for each citation edge.

Citation

If you find this code useful in your research, please consider citing the following paper:

@article{wu2026detecting,
  title={Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning},
  author={Wu, Huidong and Xiang, Haojia and Gao, Jingtong and Zhao, Xiangyu and Wu, Dengsheng and Li, Jianping},
  journal={arXiv preprint arXiv:2603.12290},
  year={2026}
}

About

Detecting Miscitation on the Scholarly Web through LLM‑Augmented Text-Rich Graph Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages