Daily Report: Reverse-Engineering Scoring Flaws from Experimental Results — Rebuilding the Agent Trajectory Quality Framework — Feb 24, 2026


Today’s Context

Today’s core work was reviewing the first round of experimental results and restructuring the scoring system accordingly. We previously completed 11 sets of LoRA fine-tuning experiments on Qwen2.5-Coder-7B-Instruct (covering Random / TopQ / ShortHQ / SuccessOnly / Ablation strategies), and the perplexity evaluation results are in. The data tells a very straightforward story: quality filtering is not statistically significant at the current scale, and one root cause lies in the scoring system itself.


1. First Round: Key Findings

Data Volume >> Quality Strategy

The most significant performance gap comes from doubling the data volume, not from careful selection:

  • Random-500 → Random-1000: Gold Loss dropped from 0.469 to 0.407 (↓ 13.2%)
  • Random-500 → TopQ-500: Gold Loss dropped from 0.469 to 0.466 (only ↓ 0.6%)

Statistical testing confirms this: the Mann-Whitney U test for TopQ-500 vs Random-500 yields p=0.27 — not significant. At the 500-sample scale, the benefits of quality filtering are virtually drowned out by noise.

Ablation: Step Efficiency Matters Most, Truncation Ratio Matters Least

Using TopQ-500 as baseline, Gold Loss change after removing each dimension:

Removed DimensionGold LossΔ
Step Efficiency0.4679+0.0022
Outcome Success0.4677+0.0020
Observation Noise0.4672+0.0015
Action Diversity0.4671+0.0014
Truncation Ratio0.4664+0.0007

While the differences are small (on the order of 0.001–0.002), the ranking is stable and points to a deeper problem — these five dimensions are not at the same level of abstraction.


2. Diagnosis: Why the Current Scoring System Falls Short

The existing 5-dimension scoring (Truncation Ratio / Outcome Success / Step Efficiency / Observation Noise / Action Diversity) has a taxonomic flaw:

  • Truncation Ratio measures data completeness — it’s metadata-level information and should not be scored alongside behavioral quality.
  • Outcome Success is a result-level metric (whether the task ultimately succeeded), while Step Efficiency and Observation Noise are process-level metrics.
  • Mixing these into a single weighted formula invites legitimate reviewer criticism about the theoretical basis for classification.

This explains why Truncation Ratio has the smallest ablation impact — it was never meant to be a scoring dimension; it should be a pre-filtering gate.


3. Restructured Framework: Gate Conditions + Three-Dimensional Layering

Layer 1: Gate Conditions (Pre-filtering)

Metrics that don’t belong to “behavioral quality” are elevated to entry gates, excluded from the final score:

  • Completeness Gate: Trajectories with truncation ratio below threshold (e.g., 0.9) are discarded outright. The vast majority in the original data are 1.0; this step mainly cleans edge cases.
  • Format Validity Gate: Validates the thought-action-observation alternating structure; severely malformed trajectories are removed.

Layer 2: Three Evaluation Dimensions

Among trajectories that pass the gates, scoring happens at a unified level of abstraction:

A. Correctness — 30%

  • Resolution Outcome: Whether the task was ultimately resolved (binary)
  • Test Awareness: Whether test suites were executed for verification
  • Patch Precision: How precise the code changes are (normalized by diff line/file count)

B. Efficiency — 40% The first round of ablation already showed this is the most critical dimension, so it gets the finest decomposition:

  • Redundant command detection (repeated execution of equivalent instructions)
  • Wasted exploration identification (whether ls/cat operations contribute to the subsequent patch)
  • Error recovery cost (number of syntax error → fix cycles)
  • Step count ratio (relative to median step count for the same task)

C. Style Quality — 30%

  • Observation cleanliness (traceback/warning token proportion)
  • Reasoning coherence (cosine similarity anomaly detection between consecutive thoughts)
  • Information utilization (whether key clues from observations are referenced in subsequent actions)

This architecture ensures all dimensions operate at the unified abstraction level of “Agent behavioral quality,” preempting reviewer concerns about taxonomic coherence.


4. Next Experimental Design

Based on the new scoring system, the second round will significantly expand in both scale and depth:

  1. Scaling Curve: Random / TopQ each covering 500 / 1000 / 2000 / 5000 data points to plot a complete scaling curve and observe whether quality filtering benefits emerge at larger scales.
  2. Dimension Ablation: Systematically removing / exclusively using Correctness, Efficiency, or Style Quality — testing in both directions.
  3. Strategy Comparison: Adding Curriculum Learning (starting with TopQ data, gradually mixing in medium-quality data) and BottomQ as a sanity check.

This implies 20+ fine-tuning experiments with significant training costs, but it’s essential for the robustness of the thesis conclusions.


5. Product Update: KitaApp MVP 1.0 Finished

KitaApp MVP 1.0 has reached its milestone with a complete business loop:

  • Core Features: Invitation-based joining, real-time announcements (with Ja-Nein voting), secure media sharing, and leave management.
  • Compliance by Design: Independent Docker deployment per kindergarten, SecureStore token management, and memory-only image caching under German DSGVO.
  • Internationalization: German (default) and English.

Due to the intensive research phase ahead, App deployment and further testing will be temporarily paused.


Today’s Summary

The greatest value of the first round of experiments is not what it proved, but what it exposed — the structural flaws in the scoring system. When your measurement tool itself lacks theoretical consistency, no amount of experimental data will escape the noise. Fix the ruler before you measure length.

Next Steps:

  • Implement the new scoring system in code (priority: forward-checking algorithm for Efficiency sub-dimensions).
  • Prepare dataset sampling for the second round’s Scaling Curve experiments.