Feb 13, 2026

Research Log: Model Debugging & OpenHands Migration — Feb 13, 2026

1. Focus of the Day: Model Debugging & Framework Migration

Today’s work focused on model-side testing and OpenHands framework refactoring.

On the model side, I reconfigured the Tool loading logic and aligned it with the evaluation server settings. The full pipeline now runs successfully. Under the current configuration:

ToolCall must be enabled.

I also identified structural issues in the previously used OpenHands V1.0.0 architecture. It is not suitable as a stable evaluation baseline, so I decided to migrate to a more stable version.

2. Debugging & Fixes

1. Workflow Cleanup

Removed legacy scripts
Cleaned outdated workflow notes
Reorganized the main evaluation pipeline

2. Freeze & Timeout Issue

Symptoms:

Inference process stalled
Timeout did not properly release resources

Fix:

Enabled streaming output (stream=True)

Result:

Significant improvement in stability

3. Missing Output Limit in Evaluation

Discovered that no maximum generation limit was set, leading to:

Token accumulation
Performance degradation
Slow execution speed

Planned improvement:

Add max token limit
Add output truncation safeguard

4. Patch Matching Issue

During SWE-bench Verified evaluation:

Patch format mismatches occurred
ToolCall formatting inconsistencies caused parsing failures

Conclusion:

Tool call formatting critically affects OpenHands behavior.

3. Recommendation: native_tool_calling Configuration

For Qwen2.5-Coder-7B-Instruct + OpenHands + SWE-bench Verified:

Explicitly set:

native_tool_calling: true

Reasoning:

Avoid unstable automatic heuristics (default None)
Use structured tool formatting
Improve tool-call stability
Fully leverage Qwen’s native tool support

4. Version Decision: OpenHands v0.54.0

Reasons for locking to v0.54.0:

Dataset generation version consistency
Fixed evaluation baseline
v0.55.0 introduces security_risk parameter requiring extra environment configuration

Key evaluation settings:

enable_history_truncation = false
enable_default_condenser = true
condenser.type = noop
enable_condensation_request = false

5. Migration Completed

Architecture Comparison

Item	Old	New (v0.54.0)
Package Manager	uv	poetry
Inference	swebench-infer	run_infer.sh
Docker	Manual build	Per-instance auto pull
LLM Config	JSON	config.toml
Evaluation	swebench-eval	eval_infer.sh

6. Current Status

✅ Architecture migration completed

✅ ToolCall pipeline verified

✅ Inference & evaluation chain fully connected

Plan for Tomorrow

Run full end-to-end SWE-bench Verified evaluation under the new architecture to validate stability.