I hate when things don’t go my way. One server in our NLB (Network Load Balancing) cluster did not want to join the cluster anymore. When I issued a wlbs start
, it tried for a couple of seconds to join the cluster, but then remained in ‘Converging’ state. A couple of times I saw an entry in the System Log " ... does not have the same number or type of port rules ..."
.
I tried:
- Reboot: worked 1st time, but after that: did not help
- Compare rules 1: compared output of
wlbs display
: changed all load parameters to ‘Equal’ (I normally give the servers weights that take into account the # of processors and #MB RAM) - Compare rules 2: compared output of
wlbs display
: identical - Compare rules 3: compared output of
regedit -e wlbs.reg.%COMPUTERNAME%.reg HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesWLBS
: no significant differences - Recreate rules 1: recreate all 18 port rules on rogue server (so much fun :-/ ): did not help
- Recreate rules 2: change ALL servers to use only 1 rule: did not help
- Curse: did not help
- Restart service: disable NLB on network adapter, press OK, re-enable NLB => Bingo! Server back in the cluster without a blink.
Now I only have to re-create the 18 rules on all servers and I’m done!
Mental note to self: check out if I can build a dedicated load balancing device in Linux, one that (1) takes into account server load (give work to least busy server) (2) response time on individual ports (and automatically disable non-responsive ports) (3) has a web interface, so I can configure from any server in the subnet.