Question about a model training crash in a small lab

I was running a new vision model on a local server in Austin last Thursday, using about 40GB of VRAM. Halfway through a 72-hour training cycle, the whole system froze. The logs just showed a memory leak in a custom data loader I wrote. I had to hard reboot, losing nearly a day of progress. I fixed it by adding better garbage collection and cutting the batch size in half. Has anyone else hit a similar wall with long training jobs on limited hardware?

3 comments

3 Comments

elliot_allen651mo ago

Austin's heat probably didn't help your server either.

thomas.sean1mo ago

Ugh, memory leaks are the worst. I always run a short test cycle now to catch that stuff before a long job.

jennifer_west7d ago

Our last leak crashed the whole cluster.