LAGMiD

Implementation code for the paper Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning.

LAGMiD addresses miscitation detection on the scholarly web by combining three components in a unified framework:

LLM-based evidence-chain reasoning over text-rich citation graphs.
LLM-to-GNN knowledge distillation with hop-level alignment.
Iterative collaborative learning driven by predictive uncertainty.

Overview

Given a citation edge, LAGMiD first extracts a multi-hop evidence chain from the citation graph. The teacher LLM then performs stepwise verification and chain-of-thought reasoning over this evidence chain. The resulting hop-level reasoning states are distilled into a student GNN, whose edge representations are trained for scalable miscitation detection. During training, uncertainty-based routing selectively sends difficult cases to the teacher, which improves efficiency while preserving semantic supervision.

Repository Structure

lagmid/
  config.py            # configuration loading and validation
  data.py              # citation graph dataset
  evidence.py          # evidence-chain extraction
  prompts.py           # teacher prompts for verify / CoT / judge
  teacher.py           # causal LLM teacher
  text_backends.py     # SciBERT-based node/edge encoder
  model.py             # student GNN and edge classifier
  trainer.py           # training, distillation, collaboration
  metrics.py           # evaluation metrics
  utils.py             # I/O and utility helpers
scripts/
  train.py             # training entry point
  predict.py           # inference entry point
configs/
  lagmid.yaml          # paper-style configuration

Data Format

The input directory must contain two JSONL files.

`nodes.jsonl`

Each line is one publication node:

{"node_id":"P1","title":"...","abstract":"..."}

`edges.jsonl`

Each line is one directed citation edge:

{"edge_id":"E1","source":"P4","target":"P1","claim_text":"...","label":0,"split":"train"}

Field definitions:

label = 1: miscitation
label = 0: valid citation
split: train, val, or test

Environment

Install dependencies:

pip install -r requirements.txt

Training

Edit configs/lagmid.yaml first, especially:

data.data_dir
encoder.hf_model_name
teacher.hf_model_name
encoder.device and teacher.device

Then run:

python scripts/train.py --config configs/lagmid.yaml

Training outputs are written to train.project_dir, including:

teacher_cache.json
best_model.pt
history.json
metrics.json

Inference

python scripts/predict.py --config configs/lagmid.yaml --checkpoint runs/lagmid/best_model.pt

The script prints one miscitation score for each citation edge.

Citation

If you find this code useful in your research, please consider citing the following paper:

@article{wu2026detecting,
  title={Detecting Miscitation on the Scholarly Web through LLM-Augmented Text-Rich Graph Learning},
  author={Wu, Huidong and Xiang, Haojia and Gao, Jingtong and Zhao, Xiangyu and Wu, Dengsheng and Li, Jianping},
  journal={arXiv preprint arXiv:2603.12290},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
figures		figures
lagmid		lagmid
scripts		scripts
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LAGMiD

Overview

Repository Structure

Data Format

`nodes.jsonl`

`edges.jsonl`

Environment

Training

Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Folders and files

Latest commit

History

Repository files navigation

LAGMiD

Overview

Repository Structure

Data Format

nodes.jsonl

edges.jsonl

Environment

Training

Inference

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

`nodes.jsonl`

`edges.jsonl`

Packages