Implementation code for the paper Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning.
LAGMiD addresses miscitation detection on the scholarly web by combining three components in a unified framework:
- LLM-based evidence-chain reasoning over text-rich citation graphs.
- LLM-to-GNN knowledge distillation with hop-level alignment.
- Iterative collaborative learning driven by predictive uncertainty.
Given a citation edge, LAGMiD first extracts a multi-hop evidence chain from the citation graph. The teacher LLM then performs stepwise verification and chain-of-thought reasoning over this evidence chain. The resulting hop-level reasoning states are distilled into a student GNN, whose edge representations are trained for scalable miscitation detection. During training, uncertainty-based routing selectively sends difficult cases to the teacher, which improves efficiency while preserving semantic supervision.
lagmid/
config.py # configuration loading and validation
data.py # citation graph dataset
evidence.py # evidence-chain extraction
prompts.py # teacher prompts for verify / CoT / judge
teacher.py # causal LLM teacher
text_backends.py # SciBERT-based node/edge encoder
model.py # student GNN and edge classifier
trainer.py # training, distillation, collaboration
metrics.py # evaluation metrics
utils.py # I/O and utility helpers
scripts/
train.py # training entry point
predict.py # inference entry point
configs/
lagmid.yaml # paper-style configuration
The input directory must contain two JSONL files.
Each line is one publication node:
{"node_id":"P1","title":"...","abstract":"..."}Each line is one directed citation edge:
{"edge_id":"E1","source":"P4","target":"P1","claim_text":"...","label":0,"split":"train"}Field definitions:
label = 1: miscitationlabel = 0: valid citationsplit:train,val, ortest
Install dependencies:
pip install -r requirements.txtEdit configs/lagmid.yaml first, especially:
data.data_direncoder.hf_model_nameteacher.hf_model_nameencoder.deviceandteacher.device
Then run:
python scripts/train.py --config configs/lagmid.yamlTraining outputs are written to train.project_dir, including:
teacher_cache.jsonbest_model.pthistory.jsonmetrics.json
python scripts/predict.py --config configs/lagmid.yaml --checkpoint runs/lagmid/best_model.ptThe script prints one miscitation score for each citation edge.
If you find this code useful in your research, please consider citing the following paper:
@article{wu2026detecting,
title={Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning},
author={Wu, Huidong and Xiang, Haojia and Gao, Jingtong and Zhao, Xiangyu and Wu, Dengsheng and Li, Jianping},
journal={arXiv preprint arXiv:2603.12290},
year={2026}
}