2021-09-19: Power outage
We experienced a power outage today 2021-09-19 at 17:40 CEST.
Timeline of events (all times CEST):
- 17:38: power goes down
- 17:40: UPS shutdown
- 17:56: power comes back
- ~18:15: work to re-up the infrastructure starts
- 18:23: under-2 is back up
- 18:24: Ceph is back up
- 18:32: k3s, vault and other under-2 services are back up
- 18:33: work to re-up OpenStack starts
- 20:20: all services are up and the all clear is given
What went well:
- all computes came back without any problem
- legacy machines came back without any problems
- kolla-ansible does a really good job at doing things well
What didn't:
- sw-baie-d didn't have its configuration persisted and thus two computes didn't have networking anymore
- nftables on under-2 doesn't start because it depends on designate, this should be resolved once all legacy domains are dropped
- we lost prod-1-k8s-node-4, as its root volume doesn't exist anymore in Ceph
- D-rack is using too much power at boot for the UPSs to restart on their own
Misc:
- while re-upping openstack, we decided to run
kolla-ansible pull
andkolla-ansible deploy
instead of manually checking services. This turned out to be quite effective, and to resolve #13 (closed) at the same time