Trajectory Quality-Aware Data Selection — Experiment Design v3
Base Model: Qwen2.5-Coder-7B-Instruct
Fine-tuning Method: LoRA
Data Source: SWE-trajectory dataset
Date: February 2026
1. Scoring System Design
1.1 Design Principles
The scoring system answers one core question: “If a human expert were doing the same task, how would they evaluate this trajectory?”
A human expert judging a developer’s debugging process focuses on two things: How smart was the approach (Efficiency)? and How clean was the process (Style)? Whether the task was “solved correctly” (Correctness) and whether the trajectory is “complete” (Completeness) serve as pre-filtering gates and do not participate in continuous scoring.
1.2 Gate Conditions (Pre-filtering)
The following conditions do not participate in scoring — they only determine whether a trajectory enters the scoring pool:
| Gate | Condition | Rationale |
|---|---|---|
| Completeness Gate | Truncation Ratio ≥ 0.9 | Nearly constant in the dataset (med=1.0, std≈0), no discriminative power; only cleans a small amount of corrupted data |
| Correctness Gate | Outcome Success = 1 (Resolved) | Binary variables should not be mixed with continuous variables in a weighted average; used as a stratification condition to rank within the resolved pool |
| Format Gate | Trajectory can be parsed into thought-action-observation structure | Malformed trajectories cannot be reliably scored |
Why not include Correctness as a scoring dimension?
Outcome Success is binary (0/1). When combined with continuous dimensions via weighted average, it dominates the score. By making it a gate, all trajectories in the scoring pool are resolved, and scoring focuses on “how well the task was completed.”
1.3 Continuous Scoring Dimensions (4 dimensions)
After filtering, 32,161 resolved trajectories remain. They are scored along 4 dimensions:
Efficiency — Is the path to the goal concise?
| Sub-dimension | Definition | Scoring Method | Distribution |
|---|---|---|---|
| B2: Error-Retry Cycles | Cost of repeated retries after errors | Count “action→error observation→similar action” cycles, normalize and invert | std=0.286, med=0.300, strongest discriminator |
| B3: Step Count Ratio | Reasonableness of step count | This trajectory’s steps / median steps across all resolved trajectories for the same task, clipped and inverted | std=0.063, med=0.800 |
Style — How clean is the trajectory as training data?
| Sub-dimension | Definition | Scoring Method | Distribution |
|---|---|---|---|
| C2: Action Diversity | Whether tool usage is reasonably diverse | Entropy of action types, normalized to [0,1] | std=0.046, med=0.655 |
| C3: Observation Utilization | Whether observation information is effectively used | Proportion of filenames (basename) / error class names from observations that are referenced in subsequent actions | std=0.118, med=0.313 |
Excluded Dimensions and Reasons
| Dimension | Reason for Exclusion |
|---|---|
| B1: Redundant Commands | std=0.033, med=0.962; almost no discriminative power. Agents in the dataset rarely execute fully duplicate commands |
| C1: Observation Cleanliness | std=0.043, med=0.967; almost no discriminative power. The vast majority of observations are clean |
Paper framing: We designed 6 candidate sub-dimensions and found through variance analysis that B1 and C1 lack discriminative power on this dataset (std < 0.05), so they were excluded. This itself is a finding — agents in the SWE-trajectory dataset are already highly homogeneous in command redundancy and observation cleanliness.
1.4 Score Aggregation
Efficiency = mean(B2, B3) # std=0.152, med=0.529
Style = mean(C2, C3) # std=0.063, med=0.485
Composite = 0.5 × Efficiency + 0.5 × Style # std=0.083, med=0.507
1.5 C3 Implementation Details
The initial C3 implementation used full path matching (e.g., src/utils.py), causing matches to fail when agents referenced utils.py (without path prefix), resulting in a median of only 0.201. After switching to basename matching, the median improved to 0.313. The remaining low utilization reflects a common “read but don’t use” problem among agents, which is itself a noteworthy finding.
2. Experimental Group Design
2.1 Overview (13 groups + 1 baseline)
| # | Experiment | Sample Pool | Selection Method | Sample Size | Block |
|---|---|---|---|---|---|
| 0 | baseline | — | No fine-tuning | — | — |
| 1 | Random-500 | Full | Random | 500 | Block 1 |
| 2 | Random-1000 | Full | Random | 1000 | Block 1 |
| 3 | TopQ-500 | Resolved | Composite score top | 500 | Block 1 |
| 4 | TopQ-1000 | Resolved | Composite score top | 1000 | Block 1 |
| 5 | ResolvedOnly-500 | Resolved | Random | 500 | Block 1 |
| 6 | ResolvedOnly-1000 | Resolved | Random | 1000 | Block 1 |
| 7 | BottomQ-500 | Resolved | Composite score bottom | 500 | Block 1 |
| 8 | Ablation-NoEfficiency-500 | Resolved | Style only | 500 | Block 2 |
| 9 | Ablation-NoStyle-500 | Resolved | Efficiency only | 500 | Block 2 |
| 10 | Ablation-NoB2-500 | Resolved | Efficiency=B3 only | 500 | Block 3 |
| 11 | Ablation-NoB3-500 | Resolved | Efficiency=B2 only | 500 | Block 3 |
| 12 | Ablation-NoC2-500 | Resolved | Style=C3 only | 500 | Block 3 |
| 13 | Ablation-NoC3-500 | Resolved | Style=C2 only | 500 | Block 3 |
2.2 Research Questions by Block
Block 1: Data Volume vs. Strategy Comparison (7 groups)
| Comparison | Research Question |
|---|---|
| exp1 vs exp5 | Does the gate help? Full random vs. resolved-pool random |
| exp5 vs exp3 | Does scoring help? Resolved random vs. resolved ranked |
| exp1→exp2, exp3→exp4, exp5→exp6 | Data scaling effect: improvement from 500→1000 across three strategies |
| exp3 vs exp7 | Sanity check: Best vs. worst to validate the scoring system |
Block 2: High-level Dimension Ablation (2 groups)
| Comparison | Research Question |
|---|---|
| exp8 vs exp9 vs exp3 | Efficiency vs. Style — which matters more? Individual vs. combined |
Block 3: Sub-dimension Ablation (4 groups)
| Comparison | Research Question |
|---|---|
| exp10 vs exp11 vs exp3 | Within Efficiency: Error-Retry Cycles vs. Step Count Ratio |
| exp12 vs exp13 vs exp3 | Within Style: Action Diversity vs. Observation Utilization |
2.3 Reusable Prior Experiments
| Experiment | Reusable? | Reason |
|---|---|---|
| baseline | ✅ Yes | No fine-tuning |
| Random-500 (exp1) | ✅ Yes | Random sampling is independent of the scoring system |
| Random-1000 (exp2) | ✅ Yes | Same reason |
| All others | ❌ Retrain | Changes in the scoring formula lead to different selected samples |
Actual new training needed: 11 groups
3. Evaluation Protocol
3.1 Perplexity Evaluation (Cross-Entropy Loss)
Compute average cross-entropy loss on assistant tokens across three independent test sets:
| Test Set | Samples | Source |
|---|---|---|
| Gold | 200 | Trajectories with highest new composite score |
| Random | 200 | Randomly sampled |
| Low-Q | 200 | Trajectories with lowest new composite score |
Note: Test sets must also be reconstructed based on the new scoring to ensure Gold/Low-Q reflects the quality ranking under the new system.
3.2 Expected Validation
All models should exhibit a Gold < Random < Low-Q loss gradient to validate the new scoring system’s soundness.
4. Paper Narrative Arc
Layer 1: Does fine-tuning help at all? baseline vs. any fine-tuned model
Layer 2: Does the gate help? Random-500 vs. ResolvedOnly-500
Layer 3: Does scoring help? ResolvedOnly-500 vs. TopQ-500
Layer 4: Data volume vs. quality? 500→1000 scaling curve (three strategies)
Layer 5: Which high-level dimension matters? EfficiencyOnly vs. StyleOnly vs. TopQ
Layer 6: Which sub-dimension matters? Sub-dimension ablation (B2/B3/C2/C3)
Validation: Is the scoring system valid? TopQ vs. BottomQ + test set quality gradient
5. Design Decision Log
| Decision | Choice | Alternative | Rationale |
|---|---|---|---|
| Truncation Ratio handling | Gate (filter) | As a scoring dimension | std≈0, no discriminative power |
| Outcome Success handling | Gate (stratification) | As a continuous scoring dimension | Binary variables should not be weighted-averaged with continuous variables |
| B1/C1 exclusion | Removed from composite | Rank normalization to force spread | Rank normalization amplifies noise; removal is more honest |
| C3 file matching | Basename matching | Full path matching | Agents often omit path prefixes; full matching causes systematic underscoring of C3 |
| Aggregation method | Hierarchical mean + equal weights | Weighted average / learned weights | Equal weights as default; weight differences can be revealed indirectly through ablation |
| Data scale | 500 / 1000 | 2000 / 5000 | GPU resource constraints |