Trajectory Quality-Aware Data Selection — Experiment Design v3


Base Model: Qwen2.5-Coder-7B-Instruct
Fine-tuning Method: LoRA
Data Source: SWE-trajectory dataset
Date: February 2026


1. Scoring System Design

1.1 Design Principles

The scoring system answers one core question: “If a human expert were doing the same task, how would they evaluate this trajectory?”

A human expert judging a developer’s debugging process focuses on two things: How smart was the approach (Efficiency)? and How clean was the process (Style)? Whether the task was “solved correctly” (Correctness) and whether the trajectory is “complete” (Completeness) serve as pre-filtering gates and do not participate in continuous scoring.

1.2 Gate Conditions (Pre-filtering)

The following conditions do not participate in scoring — they only determine whether a trajectory enters the scoring pool:

GateConditionRationale
Completeness GateTruncation Ratio ≥ 0.9Nearly constant in the dataset (med=1.0, std≈0), no discriminative power; only cleans a small amount of corrupted data
Correctness GateOutcome Success = 1 (Resolved)Binary variables should not be mixed with continuous variables in a weighted average; used as a stratification condition to rank within the resolved pool
Format GateTrajectory can be parsed into thought-action-observation structureMalformed trajectories cannot be reliably scored

Why not include Correctness as a scoring dimension?
Outcome Success is binary (0/1). When combined with continuous dimensions via weighted average, it dominates the score. By making it a gate, all trajectories in the scoring pool are resolved, and scoring focuses on “how well the task was completed.”

1.3 Continuous Scoring Dimensions (4 dimensions)

After filtering, 32,161 resolved trajectories remain. They are scored along 4 dimensions:

Efficiency — Is the path to the goal concise?

Sub-dimensionDefinitionScoring MethodDistribution
B2: Error-Retry CyclesCost of repeated retries after errorsCount “action→error observation→similar action” cycles, normalize and invertstd=0.286, med=0.300, strongest discriminator
B3: Step Count RatioReasonableness of step countThis trajectory’s steps / median steps across all resolved trajectories for the same task, clipped and invertedstd=0.063, med=0.800

Style — How clean is the trajectory as training data?

Sub-dimensionDefinitionScoring MethodDistribution
C2: Action DiversityWhether tool usage is reasonably diverseEntropy of action types, normalized to [0,1]std=0.046, med=0.655
C3: Observation UtilizationWhether observation information is effectively usedProportion of filenames (basename) / error class names from observations that are referenced in subsequent actionsstd=0.118, med=0.313

Excluded Dimensions and Reasons

DimensionReason for Exclusion
B1: Redundant Commandsstd=0.033, med=0.962; almost no discriminative power. Agents in the dataset rarely execute fully duplicate commands
C1: Observation Cleanlinessstd=0.043, med=0.967; almost no discriminative power. The vast majority of observations are clean

Paper framing: We designed 6 candidate sub-dimensions and found through variance analysis that B1 and C1 lack discriminative power on this dataset (std < 0.05), so they were excluded. This itself is a finding — agents in the SWE-trajectory dataset are already highly homogeneous in command redundancy and observation cleanliness.

1.4 Score Aggregation

Efficiency = mean(B2, B3)           # std=0.152, med=0.529
Style      = mean(C2, C3)           # std=0.063, med=0.485
Composite  = 0.5 × Efficiency + 0.5 × Style   # std=0.083, med=0.507

1.5 C3 Implementation Details

The initial C3 implementation used full path matching (e.g., src/utils.py), causing matches to fail when agents referenced utils.py (without path prefix), resulting in a median of only 0.201. After switching to basename matching, the median improved to 0.313. The remaining low utilization reflects a common “read but don’t use” problem among agents, which is itself a noteworthy finding.


2. Experimental Group Design

2.1 Overview (13 groups + 1 baseline)

#ExperimentSample PoolSelection MethodSample SizeBlock
0baselineNo fine-tuning
1Random-500FullRandom500Block 1
2Random-1000FullRandom1000Block 1
3TopQ-500ResolvedComposite score top500Block 1
4TopQ-1000ResolvedComposite score top1000Block 1
5ResolvedOnly-500ResolvedRandom500Block 1
6ResolvedOnly-1000ResolvedRandom1000Block 1
7BottomQ-500ResolvedComposite score bottom500Block 1
8Ablation-NoEfficiency-500ResolvedStyle only500Block 2
9Ablation-NoStyle-500ResolvedEfficiency only500Block 2
10Ablation-NoB2-500ResolvedEfficiency=B3 only500Block 3
11Ablation-NoB3-500ResolvedEfficiency=B2 only500Block 3
12Ablation-NoC2-500ResolvedStyle=C3 only500Block 3
13Ablation-NoC3-500ResolvedStyle=C2 only500Block 3

2.2 Research Questions by Block

Block 1: Data Volume vs. Strategy Comparison (7 groups)

ComparisonResearch Question
exp1 vs exp5Does the gate help? Full random vs. resolved-pool random
exp5 vs exp3Does scoring help? Resolved random vs. resolved ranked
exp1→exp2, exp3→exp4, exp5→exp6Data scaling effect: improvement from 500→1000 across three strategies
exp3 vs exp7Sanity check: Best vs. worst to validate the scoring system

Block 2: High-level Dimension Ablation (2 groups)

ComparisonResearch Question
exp8 vs exp9 vs exp3Efficiency vs. Style — which matters more? Individual vs. combined

Block 3: Sub-dimension Ablation (4 groups)

ComparisonResearch Question
exp10 vs exp11 vs exp3Within Efficiency: Error-Retry Cycles vs. Step Count Ratio
exp12 vs exp13 vs exp3Within Style: Action Diversity vs. Observation Utilization

2.3 Reusable Prior Experiments

ExperimentReusable?Reason
baseline✅ YesNo fine-tuning
Random-500 (exp1)✅ YesRandom sampling is independent of the scoring system
Random-1000 (exp2)✅ YesSame reason
All others❌ RetrainChanges in the scoring formula lead to different selected samples

Actual new training needed: 11 groups


3. Evaluation Protocol

3.1 Perplexity Evaluation (Cross-Entropy Loss)

Compute average cross-entropy loss on assistant tokens across three independent test sets:

Test SetSamplesSource
Gold200Trajectories with highest new composite score
Random200Randomly sampled
Low-Q200Trajectories with lowest new composite score

Note: Test sets must also be reconstructed based on the new scoring to ensure Gold/Low-Q reflects the quality ranking under the new system.

3.2 Expected Validation

All models should exhibit a Gold < Random < Low-Q loss gradient to validate the new scoring system’s soundness.


4. Paper Narrative Arc

Layer 1: Does fine-tuning help at all?         baseline vs. any fine-tuned model
Layer 2: Does the gate help?                   Random-500 vs. ResolvedOnly-500
Layer 3: Does scoring help?                    ResolvedOnly-500 vs. TopQ-500
Layer 4: Data volume vs. quality?              500→1000 scaling curve (three strategies)
Layer 5: Which high-level dimension matters?   EfficiencyOnly vs. StyleOnly vs. TopQ
Layer 6: Which sub-dimension matters?          Sub-dimension ablation (B2/B3/C2/C3)
Validation: Is the scoring system valid?       TopQ vs. BottomQ + test set quality gradient

5. Design Decision Log

DecisionChoiceAlternativeRationale
Truncation Ratio handlingGate (filter)As a scoring dimensionstd≈0, no discriminative power
Outcome Success handlingGate (stratification)As a continuous scoring dimensionBinary variables should not be weighted-averaged with continuous variables
B1/C1 exclusionRemoved from compositeRank normalization to force spreadRank normalization amplifies noise; removal is more honest
C3 file matchingBasename matchingFull path matchingAgents often omit path prefixes; full matching causes systematic underscoring of C3
Aggregation methodHierarchical mean + equal weightsWeighted average / learned weightsEqual weights as default; weight differences can be revealed indirectly through ablation
Data scale500 / 10002000 / 5000GPU resource constraints