Feb 1, 2026

vLLM, and SWE Agent—2026.02.01

February 1, 2026

Context

I had previously trained the Qwen/Qwen2.5-Coder-3B model on H200 and set up vLLM for parallel inference on RunPod for the inference phase of the SWE Benchmark.

The evaluation phase was run on a 32 vCPU Hetzner server, using the prediction files produced in the inference phase.

I found that most evaluation runs were exiting early for various reasons.

After talking with my advisor, I realized we needed to use the standard SWE Agent for evaluation, not the current setup.

What I did today

Simulated the RunPod model-serving flow on a low-end GPU; since vLLM dependencies were already working, it succeeded on the first try
Ran into many issues on the SWE Agent evaluation side (the vCPU server), but eventually got the full pipeline running
The first evaluation results were very discouraging; after some adjustments, I fixed the RunPod HTTP port connectivity issue and could continue with the rest of the pipeline
Updated the model. After a lot of errors, I found that the base model itself was problematic, so I switched to
Qwen/Qwen2.5-Coder-3B-Instruct
and for now I’m only testing the pre-SFT model pass rate on SWE Lite
- My expectations for the experiment:
  1. The base model has some pass rate
  2. A small gain after SFT
  3. Further gain after refining on the dataset
One aside: I found that SWE Agent evaluation is multi-turn interaction with the model,
while the data we use for training—SWE-Gym / OpenHands-SFT-Trajectories—consists of very long tool-call trajectories, often 30k+ tokens
That made me wonder:

Is training on OpenHands data actually effective for improving on the SWE Agent benchmark?
After going back and forth with AI, I concluded that the approach is plausible in principle
After running, submitted rate was very low, with various abnormal exits
- Running ~300 evaluations takes about 1 hour
- While putting the kid to sleep, the terminal was accidentally closed, so I didn’t see the final results
I then changed the SWE Agent YAML config and tuned the number of workers on the Hetzner server to 6
- I still ended up using the old YAML
- There’s clearly more room to optimize, but the pipeline can now run continuously
- Submitted rate improved somewhat

SWE Lite evaluation in progress