Research Log: Model Debugging & OpenHands Migration — Feb 13, 2026


1. Focus of the Day: Model Debugging & Framework Migration

Today’s work focused on model-side testing and OpenHands framework refactoring.

On the model side, I reconfigured the Tool loading logic and aligned it with the evaluation server settings. The full pipeline now runs successfully. Under the current configuration:

ToolCall must be enabled.

I also identified structural issues in the previously used OpenHands V1.0.0 architecture. It is not suitable as a stable evaluation baseline, so I decided to migrate to a more stable version.


2. Debugging & Fixes

1. Workflow Cleanup

  • Removed legacy scripts
  • Cleaned outdated workflow notes
  • Reorganized the main evaluation pipeline

2. Freeze & Timeout Issue

Symptoms:

  • Inference process stalled
  • Timeout did not properly release resources

Fix:

  • Enabled streaming output (stream=True)

Result:

  • Significant improvement in stability

3. Missing Output Limit in Evaluation

Discovered that no maximum generation limit was set, leading to:

  • Token accumulation
  • Performance degradation
  • Slow execution speed

Planned improvement:

  • Add max token limit
  • Add output truncation safeguard

4. Patch Matching Issue

During SWE-bench Verified evaluation:

  • Patch format mismatches occurred
  • ToolCall formatting inconsistencies caused parsing failures

Conclusion:

Tool call formatting critically affects OpenHands behavior.


3. Recommendation: native_tool_calling Configuration

For Qwen2.5-Coder-7B-Instruct + OpenHands + SWE-bench Verified:

Explicitly set:

native_tool_calling: true

Reasoning:

  • Avoid unstable automatic heuristics (default None)
  • Use structured tool formatting
  • Improve tool-call stability
  • Fully leverage Qwen’s native tool support

4. Version Decision: OpenHands v0.54.0

Reasons for locking to v0.54.0:

  • Dataset generation version consistency
  • Fixed evaluation baseline
  • v0.55.0 introduces security_risk parameter requiring extra environment configuration

Key evaluation settings:

enable_history_truncation = false
enable_default_condenser = true
condenser.type = noop
enable_condensation_request = false

5. Migration Completed

Architecture Comparison

ItemOldNew (v0.54.0)
Package Manageruvpoetry
Inferenceswebench-inferrun_infer.sh
DockerManual buildPer-instance auto pull
LLM ConfigJSONconfig.toml
Evaluationswebench-evaleval_infer.sh

6. Current Status

✅ Architecture migration completed

✅ ToolCall pipeline verified

✅ Inference & evaluation chain fully connected

Plan for Tomorrow

Run full end-to-end SWE-bench Verified evaluation under the new architecture to validate stability.