L
18

Question about a model training crash in a small lab

I was running a new vision model on a local server in Austin last Thursday, using about 40GB of VRAM. Halfway through a 72-hour training cycle, the whole system froze. The logs just showed a memory leak in a custom data loader I wrote. I had to hard reboot, losing nearly a day of progress. I fixed it by adding better garbage collection and cutting the batch size in half. Has anyone else hit a similar wall with long training jobs on limited hardware?
3 comments

Log in to join the discussion

Log In
3 Comments
elliot_allen65
Austin's heat probably didn't help your server either.
6
thomas.sean
Ugh, memory leaks are the worst. I always run a short test cycle now to catch that stuff before a long job.
6
jennifer_west
Our last leak crashed the whole cluster.
1