Research Log: OOM Debugging, H200 Migration, and Evaluation Troubleshooting — Feb 11, 2026
1. Training Progress
1. Completed Tasks
All previous training tasks have been completed successfully.
The overall pipeline is running stably.
2. Experiment 7 — OOM Investigation
Today’s main focus was Experiment 7.
During training, repeated Out-Of-Memory (OOM) issues occurred.
Multiple tests were conducted on both H100 and A100 80G, but neither setup was able to run successfully.
3. Root Cause Analysis
Initially, the issue was suspected to be related to truncation during data preprocessing.
Further investigation revealed that the core problem was:
- Extremely long average sequence lengths
- Some samples approaching the maximum token limit
- Dynamic padding aligning batches to the longest sequence
- 80G memory exceeding the threshold
- Backpropagation further amplifying memory consumption
Conclusion:
The issue was not caused by incorrect code logic, but by a hard constraint conflict between sequence length distribution and GPU memory limits.
4. Final Strategy & Result
Final decision:
- Keep the original code structure unchanged
- Switch to H200 for training
The H200 demonstrated strong memory and bandwidth advantages.
Training completed smoothly in approximately 2.5 hours, and all results have been successfully generated.
2. Model & Evaluation Progress
1. Model Side
Model loading and execution logic are largely complete.
Some interface and configuration details still require refinement.
All necessary updates have been documented.
2. Tool Calling Issue (OpenHands Architecture)
When evaluating the base model using the OpenHands framework, the model exhibits repeated outputs (looping behavior).
The root cause is Potential tool-call orchestration issue
Further isolation testing is required.
3. Next Steps
Although inference flow is fully operational, the evaluation-side logic may still require adjustments.
Planned focus:
- Make sure scaffold is correct
3. Development Reflection & Improvement
1. Importance of Documentation & Version Tracking
During project review, several issues became evident:
- Key materials were scattered
- Experiment versions lacked clear tracking
- Call logic was difficult to trace
As project complexity increases, relying on memory is no longer sustainable.
2. Standardization Strategy
Starting today:
- Add version descriptions to folders
- Write detailed commit messages
- Maintain structured experiment records
The goal is to build a traceable, reproducible, and auditable development workflow.
3. Summary
As project scale and experiment count grow,
detailed documentation is no longer optional — it is infrastructure.
This marks an important step forward in personal development and engineering discipline.