Research Log: Model Debugging & OpenHands Migration — Feb 13, 2026
1. Focus of the Day: Model Debugging & Framework Migration
Today’s work focused on model-side testing and OpenHands framework refactoring.
On the model side, I reconfigured the Tool loading logic and aligned it with the evaluation server settings. The full pipeline now runs successfully. Under the current configuration:
ToolCall must be enabled.
I also identified structural issues in the previously used OpenHands V1.0.0 architecture. It is not suitable as a stable evaluation baseline, so I decided to migrate to a more stable version.
2. Debugging & Fixes
1. Workflow Cleanup
- Removed legacy scripts
- Cleaned outdated workflow notes
- Reorganized the main evaluation pipeline
2. Freeze & Timeout Issue
Symptoms:
- Inference process stalled
- Timeout did not properly release resources
Fix:
- Enabled streaming output (
stream=True)
Result:
- Significant improvement in stability
3. Missing Output Limit in Evaluation
Discovered that no maximum generation limit was set, leading to:
- Token accumulation
- Performance degradation
- Slow execution speed
Planned improvement:
- Add max token limit
- Add output truncation safeguard
4. Patch Matching Issue
During SWE-bench Verified evaluation:
- Patch format mismatches occurred
- ToolCall formatting inconsistencies caused parsing failures
Conclusion:
Tool call formatting critically affects OpenHands behavior.
3. Recommendation: native_tool_calling Configuration
For Qwen2.5-Coder-7B-Instruct + OpenHands + SWE-bench Verified:
Explicitly set:
native_tool_calling: true
Reasoning:
- Avoid unstable automatic heuristics (default None)
- Use structured tool formatting
- Improve tool-call stability
- Fully leverage Qwen’s native tool support
4. Version Decision: OpenHands v0.54.0
Reasons for locking to v0.54.0:
- Dataset generation version consistency
- Fixed evaluation baseline
- v0.55.0 introduces security_risk parameter requiring extra environment configuration
Key evaluation settings:
enable_history_truncation = false
enable_default_condenser = true
condenser.type = noop
enable_condensation_request = false
5. Migration Completed
Architecture Comparison
| Item | Old | New (v0.54.0) |
|---|---|---|
| Package Manager | uv | poetry |
| Inference | swebench-infer | run_infer.sh |
| Docker | Manual build | Per-instance auto pull |
| LLM Config | JSON | config.toml |
| Evaluation | swebench-eval | eval_infer.sh |
6. Current Status
✅ Architecture migration completed
✅ ToolCall pipeline verified
✅ Inference & evaluation chain fully connected
Plan for Tomorrow
Run full end-to-end SWE-bench Verified evaluation under the new architecture to validate stability.