Feb 18, 2026

Research Log: Evaluation Dilemma & Proxy Metrics Alternative — Feb 18, 2026

1. Background

Today, I discussed with Claude a key issue encountered in the evaluation process:

Under the current evaluation setup, it is difficult to effectively display and compare experimental results.

The core problem lies in the baseline model Qwen2.5-Code-7B-Instruct, which achieves only a 1.6% pass rate on the full Verify set. Given this low baseline, when evaluating only the Verify 50 subset, obtaining 0 successful results is statistically expected.

Even after fine-tuning (e.g., using Random 500 or TopQ 500), the expected number of successful cases is only around 1.5–2. The confidence intervals heavily overlap, meaning:

It is difficult to observe statistically significant differences
Running the full Verify 50 evaluation incurs uncontrolled computational costs

As a result, the current evaluation framework lacks discriminative power under small-sample conditions.

2. Alternative Approach: Proxy Metrics

To address this limitation, Claude proposed an alternative solution: adopting proxy metrics for model evaluation.

Key Ideas:

Avoid end-to-end SWE-bench evaluation
Use Held-out Trajectory Perplexity Loss as the primary metric
No Scaffold required
Does not depend on the full SWE-bench pipeline
Can be completed within a few hours on a single A100

This approach significantly reduces evaluation cost while improving statistical stability and comparability.

3. Current Progress

Base evaluation code has been implemented
The solution appears feasible
Further validation and refinement will follow