21
Talked to a NOC engineer at a conference last week and he wrecked my whole approach to redundancy
He said most outages happen not from single box failures but from the failover itself. Told me about a client that had dual firewalls but the secondary one had a stale config from 2019. When the primary died the backup brought down half the VLANs. Checked my own kit that night and yep same problem. How often do you guys actually test your failover scenarios with production traffic?
2 comments
Log in to join the discussion
Log In2 Comments
paige_harris13d agoTop Commenter
Oh man that NOC engineer was probably running on 2 hours of sleep and stale coffee too. Ran into a similar thing at a previous job where our backup DNS server had a bad cache file from the last admin who left in 2017. When the primary DNS took a dirt nap the secondary started serving up totally wrong IPs for half our internal apps. The whole failover test thing is like going to the dentist - everyone knows they should but nobody wants to actually do it. We finally started doing quarterly "break stuff on purpose" days where we'd kill a primary system during lunch and see what actually happened. First time we did it a backup storage array just flat out refused to take over because the replication job had been silently failing for 8 months. Now I check failover configs every time I patch anything just out of pure paranoia.
3
grant_ross2713d ago
It's funny how that same pattern shows up everywhere, not just in IT. You see it with people who never test their smoke detectors or check their spare tire until something goes wrong. We all get comfortable assuming the backup plan works until reality proves otherwise.
3