Feb 25, 2026

Trajectory Quality-Aware Data Selection — Experiment Design v3

Base Model: Qwen2.5-Coder-7B-Instruct
Fine-tuning Method: LoRA
Data Source: SWE-trajectory dataset
Date: February 2026

1. Scoring System Design

1.1 Design Principles

The scoring system answers one core question: “If a human expert were doing the same task, how would they evaluate this trajectory?”

A human expert judging a developer’s debugging process focuses on two things: How smart was the approach (Efficiency)? and How clean was the process (Style)? Whether the task was “solved correctly” (Correctness) and whether the trajectory is “complete” (Completeness) serve as pre-filtering gates and do not participate in continuous scoring.

1.2 Gate Conditions (Pre-filtering)

The following conditions do not participate in scoring — they only determine whether a trajectory enters the scoring pool:

Gate	Condition	Rationale
Completeness Gate	Truncation Ratio ≥ 0.9	Nearly constant in the dataset (med=1.0, std≈0), no discriminative power; only cleans a small amount of corrupted data
Correctness Gate	Outcome Success = 1 (Resolved)	Binary variables should not be mixed with continuous variables in a weighted average; used as a stratification condition to rank within the resolved pool
Format Gate	Trajectory can be parsed into thought-action-observation structure	Malformed trajectories cannot be reliably scored

Why not include Correctness as a scoring dimension?
Outcome Success is binary (0/1). When combined with continuous dimensions via weighted average, it dominates the score. By making it a gate, all trajectories in the scoring pool are resolved, and scoring focuses on “how well the task was completed.”

1.3 Continuous Scoring Dimensions (4 dimensions)

After filtering, 32,161 resolved trajectories remain. They are scored along 4 dimensions:

Efficiency — Is the path to the goal concise?

Sub-dimension	Definition	Scoring Method	Distribution
B2: Error-Retry Cycles	Cost of repeated retries after errors	Count “action→error observation→similar action” cycles, normalize and invert	std=0.286, med=0.300, strongest discriminator
B3: Step Count Ratio	Reasonableness of step count	This trajectory’s steps / median steps across all resolved trajectories for the same task, clipped and inverted	std=0.063, med=0.800

Style — How clean is the trajectory as training data?

Sub-dimension	Definition	Scoring Method	Distribution
C2: Action Diversity	Whether tool usage is reasonably diverse	Entropy of action types, normalized to [0,1]	std=0.046, med=0.655
C3: Observation Utilization	Whether observation information is effectively used	Proportion of filenames (basename) / error class names from observations that are referenced in subsequent actions	std=0.118, med=0.313

Excluded Dimensions and Reasons

Dimension	Reason for Exclusion
B1: Redundant Commands	std=0.033, med=0.962; almost no discriminative power. Agents in the dataset rarely execute fully duplicate commands
C1: Observation Cleanliness	std=0.043, med=0.967; almost no discriminative power. The vast majority of observations are clean

Paper framing: We designed 6 candidate sub-dimensions and found through variance analysis that B1 and C1 lack discriminative power on this dataset (std < 0.05), so they were excluded. This itself is a finding — agents in the SWE-trajectory dataset are already highly homogeneous in command redundancy and observation cleanliness.

1.4 Score Aggregation

Efficiency = mean(B2, B3)           # std=0.152, med=0.529
Style      = mean(C2, C3)           # std=0.063, med=0.485
Composite  = 0.5 × Efficiency + 0.5 × Style   # std=0.083, med=0.507

1.5 C3 Implementation Details

The initial C3 implementation used full path matching (e.g., src/utils.py), causing matches to fail when agents referenced utils.py (without path prefix), resulting in a median of only 0.201. After switching to basename matching, the median improved to 0.313. The remaining low utilization reflects a common “read but don’t use” problem among agents, which is itself a noteworthy finding.

2. Experimental Group Design

2.1 Overview (13 groups + 1 baseline)

#	Experiment	Sample Pool	Selection Method	Sample Size	Block
0	baseline	—	No fine-tuning	—	—
1	Random-500	Full	Random	500	Block 1
2	Random-1000	Full	Random	1000	Block 1
3	TopQ-500	Resolved	Composite score top	500	Block 1
4	TopQ-1000	Resolved	Composite score top	1000	Block 1
5	ResolvedOnly-500	Resolved	Random	500	Block 1
6	ResolvedOnly-1000	Resolved	Random	1000	Block 1
7	BottomQ-500	Resolved	Composite score bottom	500	Block 1
8	Ablation-NoEfficiency-500	Resolved	Style only	500	Block 2
9	Ablation-NoStyle-500	Resolved	Efficiency only	500	Block 2
10	Ablation-NoB2-500	Resolved	Efficiency=B3 only	500	Block 3
11	Ablation-NoB3-500	Resolved	Efficiency=B2 only	500	Block 3
12	Ablation-NoC2-500	Resolved	Style=C3 only	500	Block 3
13	Ablation-NoC3-500	Resolved	Style=C2 only	500	Block 3

2.2 Research Questions by Block

Block 1: Data Volume vs. Strategy Comparison (7 groups)

Comparison	Research Question
exp1 vs exp5	Does the gate help? Full random vs. resolved-pool random
exp5 vs exp3	Does scoring help? Resolved random vs. resolved ranked
exp1→exp2, exp3→exp4, exp5→exp6	Data scaling effect: improvement from 500→1000 across three strategies
exp3 vs exp7	Sanity check: Best vs. worst to validate the scoring system

Block 2: High-level Dimension Ablation (2 groups)

Comparison	Research Question
exp8 vs exp9 vs exp3	Efficiency vs. Style — which matters more? Individual vs. combined

Block 3: Sub-dimension Ablation (4 groups)

Comparison	Research Question
exp10 vs exp11 vs exp3	Within Efficiency: Error-Retry Cycles vs. Step Count Ratio
exp12 vs exp13 vs exp3	Within Style: Action Diversity vs. Observation Utilization

2.3 Reusable Prior Experiments

Experiment	Reusable?	Reason
baseline	✅ Yes	No fine-tuning
Random-500 (exp1)	✅ Yes	Random sampling is independent of the scoring system
Random-1000 (exp2)	✅ Yes	Same reason
All others	❌ Retrain	Changes in the scoring formula lead to different selected samples

Actual new training needed: 11 groups

3. Evaluation Protocol

3.1 Perplexity Evaluation (Cross-Entropy Loss)

Compute average cross-entropy loss on assistant tokens across three independent test sets:

Test Set	Samples	Source
Gold	200	Trajectories with highest new composite score
Random	200	Randomly sampled
Low-Q	200	Trajectories with lowest new composite score

Note: Test sets must also be reconstructed based on the new scoring to ensure Gold/Low-Q reflects the quality ranking under the new system.

3.2 Expected Validation

All models should exhibit a Gold < Random < Low-Q loss gradient to validate the new scoring system’s soundness.

4. Paper Narrative Arc

Layer 1: Does fine-tuning help at all?         baseline vs. any fine-tuned model
Layer 2: Does the gate help?                   Random-500 vs. ResolvedOnly-500
Layer 3: Does scoring help?                    ResolvedOnly-500 vs. TopQ-500
Layer 4: Data volume vs. quality?              500→1000 scaling curve (three strategies)
Layer 5: Which high-level dimension matters?   EfficiencyOnly vs. StyleOnly vs. TopQ
Layer 6: Which sub-dimension matters?          Sub-dimension ablation (B2/B3/C2/C3)
Validation: Is the scoring system valid?       TopQ vs. BottomQ + test set quality gradient

5. Design Decision Log

Decision	Choice	Alternative	Rationale
Truncation Ratio handling	Gate (filter)	As a scoring dimension	std≈0, no discriminative power
Outcome Success handling	Gate (stratification)	As a continuous scoring dimension	Binary variables should not be weighted-averaged with continuous variables
B1/C1 exclusion	Removed from composite	Rank normalization to force spread	Rank normalization amplifies noise; removal is more honest
C3 file matching	Basename matching	Full path matching	Agents often omit path prefixes; full matching causes systematic underscoring of C3
Aggregation method	Hierarchical mean + equal weights	Weighted average / learned weights	Equal weights as default; weight differences can be revealed indirectly through ablation
Data scale	500 / 1000	2000 / 5000	GPU resource constraints