Optimizing SWE-agent Evaluation Pipelines——2026.02.02


📝 Technical Daily: SWE-agent Evaluation Post-Mortem & Optimization

Date: February 2, 2026

1. Progress & Root Cause Analysis (RCA)

  • Storage & Environment Bottlenecks: * The evaluation endpoint’s 600GB storage reached capacity due to accumulated, uncleared Docker images, causing the task to crash at instance #298.
    • Lessons Learned: While an additional 800GB was mounted, the default Docker root directory remained on the primary partition. Combined with Docker Hub rate-limiting and RunPod node instability, this led to multiple execution failures.
  • Compute Resource Tuning: * Experimented with both NVIDIA RTX 4090 and A6000. Although the A6000 offers superior VRAM, improper parameter configuration resulted in sub-optimal performance. I have temporarily reverted to the 4090 to maintain a stable baseline.
  • Current Status: * After overnight debugging, I successfully completed a full inference pass.
    • Throughput: 29 instances submitted during the SWE-agent inference phase (~9.7% submission rate).

2. Key Findings & Evaluation Metrics

  • Evaluation Pipeline: Confirmed that SWE-agent does not trigger automatic evaluation by default (online evaluation requires explicit activation). I have now successfully established a local evaluation pipeline to bypass external dependencies.
  • Preliminary Results: * Out of the 29 submissions, only 19 batches were non-empty.
    • Pass@1 Rate: Currently 0%.
  • Failure Analysis: Investigating two primary hypotheses:
    1. Pipeline/Environment Mismatch: The generated patches may not be applying correctly due to configuration errors or script mismatches.
    2. Model Capability Ceiling: The current model may lack the reasoning depth required for complex SWE-bench tasks.

3. Optimization Roadmap & Next Steps

  • Throughput & Concurrency Tuning:
    • Worker Optimization: Calibrate the number of parallel workers to find the “sweet spot” between VRAM utilization and system stability to prevent OOM (Out of Memory) crashes.
    • Context Window Expansion: Test increasing the token limit from 8k to 16k to evaluate the marginal gain of longer context on problem-solving accuracy.
  • Model Validation (Baseline Alignment):
    • Following the advisor’s recommendation, I will benchmark Qwen2.5-Coder (Flash) with a 32k context window.
    • Benchmark Goal: Target a 20-30% success rate. If achieved, it validates the pipeline; if it fails, it confirms an environmental issue.
  • Config & Training:
    • Refine the swe-agent.yaml configuration to reduce inference redundancy.
    • Simultaneously push forward with SFT (Supervised Fine-Tuning) and Dataset Refinement.

4. Resource Utilization

  • Cursor Pro Quota: Consumption is extremely high due to intensive code generation and debugging. The new account reached 1/3 of its weekly quota within the first 24 hours. Usage management is now a priority.

SWE-agent evaluation screenshot