Research Log: Evaluation Dilemma & Proxy Metrics Alternative — Feb 18, 2026


1. Background

Today, I discussed with Claude a key issue encountered in the evaluation process:

Under the current evaluation setup, it is difficult to effectively display and compare experimental results.

The core problem lies in the baseline model Qwen2.5-Code-7B-Instruct, which achieves only a 1.6% pass rate on the full Verify set. Given this low baseline, when evaluating only the Verify 50 subset, obtaining 0 successful results is statistically expected.

Even after fine-tuning (e.g., using Random 500 or TopQ 500), the expected number of successful cases is only around 1.5–2. The confidence intervals heavily overlap, meaning:

  • It is difficult to observe statistically significant differences
  • Running the full Verify 50 evaluation incurs uncontrolled computational costs

As a result, the current evaluation framework lacks discriminative power under small-sample conditions.


2. Alternative Approach: Proxy Metrics

To address this limitation, Claude proposed an alternative solution: adopting proxy metrics for model evaluation.

Key Ideas:

  1. Avoid end-to-end SWE-bench evaluation
  2. Use Held-out Trajectory Perplexity Loss as the primary metric
  3. No Scaffold required
  4. Does not depend on the full SWE-bench pipeline
  5. Can be completed within a few hours on a single A100

This approach significantly reduces evaluation cost while improving statistical stability and comparability.


3. Current Progress

  • Base evaluation code has been implemented
  • The solution appears feasible
  • Further validation and refinement will follow