Research Log: Evaluation Dilemma & Proxy Metrics Alternative — Feb 18, 2026
1. Background
Today, I discussed with Claude a key issue encountered in the evaluation process:
Under the current evaluation setup, it is difficult to effectively display and compare experimental results.
The core problem lies in the baseline model Qwen2.5-Code-7B-Instruct, which achieves only a 1.6% pass rate on the full Verify set. Given this low baseline, when evaluating only the Verify 50 subset, obtaining 0 successful results is statistically expected.
Even after fine-tuning (e.g., using Random 500 or TopQ 500), the expected number of successful cases is only around 1.5–2. The confidence intervals heavily overlap, meaning:
- It is difficult to observe statistically significant differences
- Running the full Verify 50 evaluation incurs uncontrolled computational costs
As a result, the current evaluation framework lacks discriminative power under small-sample conditions.
2. Alternative Approach: Proxy Metrics
To address this limitation, Claude proposed an alternative solution: adopting proxy metrics for model evaluation.
Key Ideas:
- Avoid end-to-end SWE-bench evaluation
- Use Held-out Trajectory Perplexity Loss as the primary metric
- No Scaffold required
- Does not depend on the full SWE-bench pipeline
- Can be completed within a few hours on a single A100
This approach significantly reduces evaluation cost while improving statistical stability and comparability.
3. Current Progress
- Base evaluation code has been implemented
- The solution appears feasible
- Further validation and refinement will follow