Official code release for the PEFT-Arena paper:
PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective
Yangyi Huang, Ruotian Peng, Zeju Qiu, Jiale Kang, Yandong Wen, Bernhard Schölkopf, Weiyang Liu
Project page: https://spherelab.ai/PEFT-Arena
GitHub: https://github.com/Sphere-AI-Lab/PEFT-Arena
Project site source: docs/
PEFT-Arena studies parameter-efficient finetuning through the stability-plasticity dilemma: how much a post-trained model improves on the target domain, and how much of its pretrained general capability it retains.
This repository contains the official training, evaluation, and analysis code for the paper's main experimental workflows:
- supervised finetuning (SFT)
- reinforcement learning with verifiable rewards (RLVR / GRPO)
- evaluation on target-domain and general-retention benchmarks
- spectral retention-adaptation profiling and plotting
The benchmark covers two target domains:
- mathematical reasoning
- medical reasoning
and measures general capability retention on:
- BBH
- IFEval
- NQ
The paper reports experiments on Qwen2.5-7B and Llama3.2-3B-Instruct, comparing full finetuning and representative PEFT families including LoRA variants, OFT, IA3, VeRA, MiSS, and KeepLoRA.
- post-training with SFT and RLVR
- evaluation on target-domain and general-retention benchmarks
- spectral analysis and figure generation used in the paper
- Checkpoints and data
- A unified benchmark for evaluating PEFT beyond downstream accuracy alone.
- A stability-plasticity view of SFT and RLVR post-training.
- Spectral analysis tools for studying retention and adaptation structure in weight updates.
- Reproducible training and evaluation entrypoints through a single CLI in
run.py.
run.py: unified CLI for training, evaluation, and adapter mergetrain/: SFT and RL training wrappers plus PEFT-Arena-owned trainer codeeval/: math, medical, and general evaluation pipelinestools/: checkpoint preparation, merge, spectral analysis, and plottingthird_party/math_evalandthird_party/med_eval: bundled target-domain evaluation codethird_party/opencompassandthird_party/verl: external dependencies used by general evaluation and RL trainingdocs/: project website / GitHub Pages source
Run commands from the repository root.
You should start from a Python environment with a compatible CUDA / PyTorch stack. The release setup script installs the PEFT-Arena-side dependencies on top of that environment.
Typical setup:
bash setup_env.shThis script:
- validates and backfills required
third_party/components - installs training and evaluation dependencies
- installs the patched
math_eval/latex2sympypackage withantlr4-python3-runtime==4.9.3 - installs
OpenCompass,VeRL,vllm,human-eval, andevalplus
If you only want to fetch or validate the third-party trees:
bash setup_third_party.shNotes:
third_party/math_evalandthird_party/med_evalare included in this release.third_party/human-evalis not tracked in git;setup_third_party.shwill copy or fetch it when needed.- OpenCompass benchmark data is not bundled. Some datasets download automatically on first use, while others such as
IFEvalandmmlustill require local dataset preparation underthird_party/opencompass/data.
- Math:
math500,amc23,aime24 - Medical: packaged
med_evalbenchmark set including medical QA and reasoning tasks used in the paper
bbhifeval_nq- extended wrappers for
humaneval,hellaswag,winogrande,mmlu,arc,gsm8k, andxcopa
- SFT
- math data: filtered 50k samples from OpenR1-Math
- medical data: 23k samples from MedThink
- RLVR
- PEFT-Arena RL training with GRPO
- release code keeps the RL dataset and async rollout / agent-loop path aligned with the current
peft_arenaimplementation
python run.py train sft \
--model Qwen/Qwen2.5-7B \
--adapter lora \
--lora_rank 16 \
--lora_alpha 32 \
--data_train data/openr1-50k/train.parquet \
--data_val data/openr1-50k/test.parquet \
--output_dir checkpoints/sft/math/qwen2.5-7b/lora-r16python run.py train rl \
--model Qwen/Qwen2.5-7B \
--adapter oft \
--oft_block_size 32 \
--data_train data/openr1-50k/train.parquet \
--data_val data/openr1-50k/test.parquet \
--output_dir checkpoints/rl/math/qwen2.5-7b/oft-b32Evaluate one checkpoint on all supported domains:
python run.py eval \
--checkpoint_path checkpoints/sft/math/qwen2.5-7b/lora-r16/global_step_780 \
--domain allEvaluate general-retention benchmarks only:
python run.py eval \
--checkpoint_path checkpoints/sft/med/qwen2.5-7b/oft-b16/global_step_364 \
--domain general \
--benchmarks bbh,ifeval_nq,humanevalConvenience wrappers remain available:
bash eval/eval_math.sh --checkpoint_path <ckpt>
bash eval/eval_med.sh --checkpoint_path <ckpt>
bash eval/eval_general.sh --checkpoint_path <ckpt>Prepare a training checkpoint for evaluation:
python tools/prepare_eval_checkpoint.py \
--checkpoint_path checkpoints/sft/med/qwen2.5-7b/oft-b16/global_step_364Merge a PEFT adapter into a standalone Hugging Face checkpoint:
python run.py merge \
--adapter_path checkpoints/sft/math/qwen2.5-7b/lora-r16/global_step_780 \
--output_path checkpoints/sft/math/qwen2.5-7b/lora-r16/global_step_780_mergedpython eval/summarize_results.py --results_dir results
python eval/extract_new_benchmark_summary.py \
--input results/summary.csv \
--output results/new_benchmark_summary.csvRun spectral analysis for one base / finetuned pair:
python tools/spectral_analysis.py \
--base_model Qwen/Qwen2.5-7B \
--finetuned_model checkpoints/sft/math/qwen2.5-7b/lora-r8/global_step_780 \
--output_dir analysis/math/sft/qwen2.5-7b/lora-r8/global_step_780 \
--layers 18 \
--modules down_projPlot the analysis outputs:
python tools/plot_spectral_analysis.py \
--input_dirs analysis/math/sft/qwen2.5-7b/full/global_step_780 analysis/math/sft/qwen2.5-7b/oft-b32/global_step_780 analysis/math/sft/qwen2.5-7b/lora-r8/global_step_780 \
--labels SFT-FullFT SFT-OFT-b32 SFT-LoRA-r8 \
--output_dir analysis/plot_sft_spectrum \
--plot_type curves \
--layer_names "model.layers.18.mlp.down_proj.weight" \
--log_scaleComposite and utility plots:
python tools/plot_spectral_composite.py \
--output analysis/plot_composite/composite_layer18_mlp_down_proj.pdf
python tools/plot_spectral_method_grid.py \
--output analysis/plot_composite/method_grid_layer18_mlp_down_proj.pdf
python tools/compare_model_norms.py \
--base-model Qwen/Qwen2.5-7B \
--sft-model checkpoints/sft/math/qwen2.5-7b/lora-r8/global_step_780 \
--output analysis/model_norms/qwen2.5-7b_lora-r8peft_arena_release/
├── run.py
├── setup_env.sh
├── setup_third_party.sh
├── configs/
├── docs/
├── train/
├── eval/
├── tools/
├── tests/
└── third_party/
If you find PEFT-Arena useful in your research, please cite:
@misc{huang2026peftarena,
title={PEFT-Arena: Understanding Parameter-Efficient Finetuning from a Stability-Plasticity Perspective},
author={Yangyi Huang and Ruotian Peng and Zeju Qiu and Jiale Kang and Yandong Wen and Bernhard Sch\"olkopf and Weiyang Liu},
year={2026},
}This repository is released under the MIT License. See LICENSE.
