Research Log: Multi-Subset Preprocessing & Cloud Training Stabilization — Feb 7, 2026


1. Overall Progress Overview

Today’s work primarily focused on local preprocessing of multiple data subsets and stabilizing the cloud-based training environment.
At this stage, the overall workflow has officially transitioned from the data preparation phase to the long-running training and monitoring phase.

Although multiple unexpected bugs were encountered during cloud training, they were gradually diagnosed and mitigated through iterative debugging and parameter adjustments. At present, the training pipeline has entered a relatively stable execution state.


2. Local Data Preprocessing

2.1 Subset Coverage

In the local environment, tokenizer processing and unified preprocessing were completed for all 11 subsets, including:

ALL_SUBSETS = [ “Random-500”, “Random-1000”, “TopQ-500”, “TopQ-1000”, “ShortHQ-500”, “SuccessOnly-500”, “DimAblation-no_truncation_ratio”, “DimAblation-no_outcome_success”, “DimAblation-no_step_efficiency”, “DimAblation-no_observation_noise”, “DimAblation-no_action_diversity”, ]

These subsets cover:

  • Random sampling baselines (Random)
  • Quality-based filtering strategies (TopQ / ShortHQ / SuccessOnly)
  • Multi-dimensional ablation settings (Dim Ablation)

Together, they form a structured and controllable experimental input space for subsequent comparative analysis.


3. Cloud Training Progress (RunPod / A100)

3.1 Data Migration and Environment Issues

  • All preprocessed datasets were successfully transferred to the RunPod cloud environment.
  • However, multiple environment-level and runtime bugs were encountered during training initialization and execution.
  • The current strategy prioritizes small iterative runs with continuous monitoring, aiming to quickly localize issues while minimizing wasted GPU time.

3.2 Training Time Estimation

  • Under the current configuration, the estimated total training cost is approximately 50 A100 (80GB) GPU-hours.
  • The training job is still running.
  • Periodic manual log inspection is required to guard against long-duration silent failures.

4. Training Configuration and Issue Resolution

4.1 Core Training Setup

  • 2 epochs per subset
  • 500 trajectories per test case (i.e., 1,000 samples)
  • 8 parallel configurations
  • Approximately 124 total training steps, which ensures that each sample is seen twice

This setup represents a compact yet complete experimental scale, balancing coverage and cost.


4.2 Memory and Batch Size Optimization

Issue 1: GPU Out-of-Memory on Initial Run

  • The initial configuration triggered GPU OOM on the A100.
  • Mitigation steps:
    • Reduced batch size to 2
    • Increased gradient accumulation steps to 4

As a result, the effective batch size remains 8, preserving experimental comparability.


4.3 Optimizer Instability (Critical Issue)

Issue 2: Training Crash at Step 11

  • The second run consistently crashed at step 11.
  • Investigation traced the issue to:
    • The PagedAdamW 8-bit optimizer from bitsandbytes
    • Unstable page-based memory management under the current training scale

Resolution:

  • Replaced PagedAdamW 8-bit with the non-paged variant
  • All optimizer states are now fully resident on GPU memory
  • Given the 80GB VRAM capacity of A100, the additional memory overhead is acceptable

5. Current Status and Expectations

After applying the above fixes, the training process has been restarted.

Key points under observation:

  • Whether the failure reoccurs around step 11 or the 11th test case
  • If no recurrence is observed, the previous crashes can be confidently attributed to the paged optimizer mechanism

The current priority is training stability, ensuring a reliable foundation for downstream analysis and ablation studies.


6. Next Steps (Brief)

  • Continue monitoring the ongoing training job
  • If the current run completes successfully:
    • Conduct performance comparisons across different subsets
    • Focus on analyzing the impact of quality-based filtering and each ablation dimension on model performance