Research Log: OOM Debugging, H200 Migration, and Evaluation Troubleshooting — Feb 11, 2026


1. Training Progress

1. Completed Tasks

All previous training tasks have been completed successfully.
The overall pipeline is running stably.

2. Experiment 7 — OOM Investigation

Today’s main focus was Experiment 7.
During training, repeated Out-Of-Memory (OOM) issues occurred.

Multiple tests were conducted on both H100 and A100 80G, but neither setup was able to run successfully.

3. Root Cause Analysis

Initially, the issue was suspected to be related to truncation during data preprocessing.
Further investigation revealed that the core problem was:

  • Extremely long average sequence lengths
  • Some samples approaching the maximum token limit
  • Dynamic padding aligning batches to the longest sequence
  • 80G memory exceeding the threshold
  • Backpropagation further amplifying memory consumption

Conclusion:
The issue was not caused by incorrect code logic, but by a hard constraint conflict between sequence length distribution and GPU memory limits.

4. Final Strategy & Result

Final decision:

  • Keep the original code structure unchanged
  • Switch to H200 for training

The H200 demonstrated strong memory and bandwidth advantages.
Training completed smoothly in approximately 2.5 hours, and all results have been successfully generated.


2. Model & Evaluation Progress

1. Model Side

Model loading and execution logic are largely complete.
Some interface and configuration details still require refinement.
All necessary updates have been documented.

2. Tool Calling Issue (OpenHands Architecture)

When evaluating the base model using the OpenHands framework, the model exhibits repeated outputs (looping behavior).

The root cause is Potential tool-call orchestration issue

Further isolation testing is required.

3. Next Steps

Although inference flow is fully operational, the evaluation-side logic may still require adjustments.

Planned focus:

  • Make sure scaffold is correct

3. Development Reflection & Improvement

1. Importance of Documentation & Version Tracking

During project review, several issues became evident:

  • Key materials were scattered
  • Experiment versions lacked clear tracking
  • Call logic was difficult to trace

As project complexity increases, relying on memory is no longer sustainable.

2. Standardization Strategy

Starting today:

  • Add version descriptions to folders
  • Write detailed commit messages
  • Maintain structured experiment records

The goal is to build a traceable, reproducible, and auditable development workflow.

3. Summary

As project scale and experiment count grow,
detailed documentation is no longer optional — it is infrastructure.

This marks an important step forward in personal development and engineering discipline.