Feb 3, 2026

Research Log: Kindergarten App & Model Evaluation — Feb 3, 2026

The Trigger: A Broken Kindergarten App

As a member of the kindergarten parents’ committee, I attended a committee meeting yesterday where one complaint kept resurfacing: the kindergarten’s app is genuinely bad.
What is presented as an “app” is essentially a web page wrapped inside a mobile shell. It is slow, unresponsive, and media content—especially photos and videos—often fails to load.

After the meeting, I outlined a product concept and basic feature plan and reached out to a software outsourcing team in China. My initial approach is intentionally minimal:
👉 design a single main-screen UI to validate visual quality and interaction flow before deciding whether the idea is worth deeper investment.

This issue points to a clear market gap.
Kindergartens in Germany are undergoing digitalization, yet most existing solutions are outdated and built with a “just usable” mindset rather than a user-centric one. I see this as a meaningful opportunity worth exploring.

Research Progress Update

Today’s work focused on model evaluation and inference infrastructure.

I began by deploying a 16k-context model on an RTX 6000 Ada, but soon ran into a subtle networking issue.
Because the RunPod instance was hosted in the US and I was using an HTTP-based API, the Cloudflare timeout settings conflicted with my local and evaluation-side timeouts, causing the inference to stall indefinitely at step 1.

After identifying the root cause, I switched to an A100 setup.
Inference speed improved dramatically, and the overall pipeline behaved much more reliably, with noticeably better submission behavior.

Issues and Findings

After adjusting the YAML configuration, the run failed halfway due to a simple but costly oversight:
Docker image pulls require authentication, and I had forgotten to log in.

I restarted the process and documented this step carefully to avoid repeating the mistake. However, the more serious problem emerged during result analysis:

The vast majority of submitted patches were empty
The few non-empty patches were clearly hallucinated
Almost none were practically applicable

This led to an unavoidable conclusion:
👉 Qwen2.5 Coder 3B likely lacks the fundamental capability to handle this class of tasks.
Using this model as a baseline for such evaluations is, in practice, unrealistic.

Next Steps

The outcome is disappointing, but the direction is now clearer.
I need to search for a more viable approach—one that balances model capacity, compute cost, and practical effectiveness.

The search continues.
The road is still ahead.