Figure 1: Pass@1 accuracy on AIME 2024–2025 vs. the number of SFT examples for various 32B parameter distilled reasoning models. Notably, QwQ-32B-Distill-32B from our work reaches 63.5% average accuracy using <1K examples, rivaling or exceeding many state-of-the-art large-data baselines. See Table 2 for more details.
Figure 2: Pass@1 accuracy of student models versus their teacher model’s accuracy on AIME 2025. Each marker represents a specific teacher–student pairing (marker
shape denotes the teacher model; color denotes student scale: 1.5B in blue, 7B in orange, 32B in
green). Larger students exhibit a much stronger positive correlation between teacher and student
performance.
python3.12 -m venv env
source env/bin/activate
pip install -r requirements.txt