2021-09-19: Power outage

We experienced a power outage today 2021-09-19 at 17:40 CEST.

Timeline of events (all times CEST):

  • 17:38: power goes down
  • 17:40: UPS shutdown
  • 17:56: power comes back
  • ~18:15: work to re-up the infrastructure starts
  • 18:23: under-2 is back up
  • 18:24: Ceph is back up
  • 18:32: k3s, vault and other under-2 services are back up
  • 18:33: work to re-up OpenStack starts
  • 20:20: all services are up and the all clear is given

What went well:

  • all computes came back without any problem
  • legacy machines came back without any problems
  • kolla-ansible does a really good job at doing things well

What didn't:

  • sw-baie-d didn't have its configuration persisted and thus two computes didn't have networking anymore
  • nftables on under-2 doesn't start because it depends on designate, this should be resolved once all legacy domains are dropped
  • we lost prod-1-k8s-node-4, as its root volume doesn't exist anymore in Ceph
  • D-rack is using too much power at boot for the UPSs to restart on their own

Misc:

  • while re-upping openstack, we decided to run kolla-ansible pull and kolla-ansible deploy instead of manually checking services. This turned out to be quite effective, and to resolve #13 (closed) at the same time
Edited Sep 19, 2021 by Marc Schmitt
Assignee Loading
Time tracking Loading