I entered the office and fired up my trusty VIClient to connect to our design environment, just to discover all machines were shutdown. Suddenly the phone started ringing. All production VMs were shutdown as well? What happened?
It turned out to be a massive (physical) switch failure: during the weekend, all switches were failing in cascade due to some switch intervention (oeps). Result: all NICs in all ESX hosts lost their connection. So, every ESX host had noticed it couldn't ping it's default gateway and started shutting down ("releasing") all VMs (the default HA behavior). But...there were no other ESX hosts to pick up the host failure :)
After booting the VMs, we noticed something strange: VirtualCenter was completely out of sync with what was running on the ESX hosts themselves and we couldn't VMotion/Edit Settings/.. our VMs. Some VMs appeared to be running on a host when they were not! Really strange, strange stuff.
To get your VirtualCenter back in sync (and if rebooting the VC Service doesn't help), you must restart the vpxa (VirtualCenter) agent on every ESX host by issuing the following command: