openstack: rabbitmq timeouts causing services to be slow/unresponsive/buggy

RabbitMQ was found to be the cause of the original problem, thus the updated title/description.

Timeline of events (all times CEST):

  • 2021-09-19 17:38: #20 (closed) happened
  • 2021-09-19 ~18:30-20:00: OpenStack is upgraded as part of #13 (closed) and !19 (merged) as it was an opportunity to do it as services were down anyway and there was no pressure to bring them back up
  • 2021-09-20 - 2021-09-21: not much work is done on OpenStack and thus we didn't see anything wrong
  • 2021-09-22: we start seeing errors on VM creating and created this issue
  • 2021-09-23 01:34: volumes are not being properly detached/re-attached to instances when managed by Cinder CSI. Upgrading Cinder CSI doesn't help. Some logs report timeouts with RabbitMQ. Deleting and re-creating the RabbitMQ cluster doesn't help. Re-creating volumes fixes the issue and we go to sleep.
  • 2021-09-23 ~18:00: terraform apply in overcloud/terraform/ results in Designate erroring with 504 Gateway Time-out. Again, logs show timeouts with RabbitMQ.
  • 2021-09-24 ~22:00 - ~01:40: Ceph is upgraded to 40Gbit/s (4x10Gbit/s), which shouldn't help as monitoring wasn't showing network pressure, and as expected, it didn't. (This maintenance was scheduled to be done before next week anyway)
  • 2021-09-24 03:05: after some testing, Designate still errors with Timed out waiting for a reply to message ID [...] (full logs below)
  • 2021-09-24 ~03:14: RabbitMQ is deployed again with rabbitmq_enable_healthchecks: "no" and Designate is restarted. No more errors.
  • 2021-09-24 ~03:15: all services are being restarted
  • 2021-09:24 ~04:00: the all clear is given.

Logs from services (in this case designate, but we were seing errors coming from a bunch of places):

2021-09-24 01:05:18.016 20 ERROR designate.api.middleware Traceback (most recent call last):
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self._queues[msg_id].get(block=True, timeout=timeout)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return waiter.wait()
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return get_hub().switch()
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 313, in switch
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.greenlet.switch()
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware _queue.Empty
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware 
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware During handling of the above exception, another exception occurred:
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware 
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware Traceback (most recent call last):
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/api/middleware.py", line 238, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return request.get_response(self.application)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1313, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     status, headers, app_iter = self.call_application(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1278, in call_application
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = application(self.environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 129, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     resp = self.call_func(req, *args, **kw)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 193, in call_func
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.func(req, *args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/api/middleware.py", line 320, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return request.get_response(self.application)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1313, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     status, headers, app_iter = self.call_application(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1278, in call_application
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = application(self.environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 129, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     resp = self.call_func(req, *args, **kw)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 193, in call_func
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.func(req, *args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/keystonemiddleware/auth_token/__init__.py", line 341, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     response = req.get_response(self._app)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1313, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     status, headers, app_iter = self.call_application(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1278, in call_application
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = application(self.environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 129, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     resp = self.call_func(req, *args, **kw)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 193, in call_func
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.func(req, *args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_middleware/base.py", line 124, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     response = req.get_response(self.application)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1313, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     status, headers, app_iter = self.call_application(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1278, in call_application
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = application(self.environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 129, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     resp = self.call_func(req, *args, **kw)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 193, in call_func
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.func(req, *args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_middleware/base.py", line 124, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     response = req.get_response(self.application)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1313, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     status, headers, app_iter = self.call_application(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1278, in call_application
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = application(self.environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 129, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     resp = self.call_func(req, *args, **kw)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/dec.py", line 193, in call_func
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.func(req, *args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/api/middleware.py", line 226, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return request.get_response(self.application)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1313, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     status, headers, app_iter = self.call_application(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/webob/request.py", line 1278, in call_application
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = application(self.environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/pecan/middleware/recursive.py", line 56, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.application(environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/pecan/middleware/errordocument.py", line 75, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     app_iter = self.app(environ, replacement_start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/pecan/core.py", line 840, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return super(Pecan, self).__call__(environ, start_response)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/pecan/core.py", line 683, in __call__
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     self.invoke_controller(controller, args, kwargs, state)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/pecan/core.py", line 574, in invoke_controller
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     result = controller(*args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/utils.py", line 216, in wrapper
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return f(*args, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/api/v2/controllers/zones/recordsets.py", line 52, in get_all
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     recordsets = common.retrieve_matched_rrsets(context, self, zone_id,
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/api/v2/controllers/common.py", line 23, in retrieve_matched_rrsets
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     controller_obj.central_api.get_zone(context, zone_id)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/designate/central/rpcapi.py", line 149, in get_zone
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.client.call(context, 'get_zone', zone_id=zone_id)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 509, in call
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self.prepare().call(ctxt, method, **kwargs)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 175, in call
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     self.transport._send(self.target, msg_ctxt, msg,
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self._driver.send(target, ctxt, message,
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 680, in send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     return self._send(target, ctxt, message, wait_for_reply, timeout,
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 669, in _send
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     result = self._waiter.wait(msg_id, timeout,
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in wait
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     message = self.waiters.get(msg_id, timeout=timeout)
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware   File "/var/lib/kolla/venv/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 435, in get
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware     raise oslo_messaging.MessagingTimeout(
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID de8afcd630ea4c17a4061b183d797d83
2021-09-24 01:05:18.016 20 ERROR designate.api.middleware 
2021-09-24 01:05:18.019 20 INFO eventlet.wsgi.server [req-8845c9f9-c988-4bbd-b071-166b5b6f07b0 b867d59eb3f34e2eab684b74e2ba7e87 28d6433dd9db4e7298c84b1058ea825a - - -] 192.168.210.230,192.168.210.105 "GET /v2/zones/d2c5ce5e-d24c-49fe-8665-0b70f24b26ed/recordsets HTTP/1.1" status: 504  len: 277 time: 60.0090218
2021-09-24 01:05:18.927 20 ERROR designate.api.middleware [req-8845c9f9-c988-4bbd-b071-166b5b6f07b0 b867d59eb3f34e2eab684b74e2ba7e87 28d6433dd9db4e7298c84b1058ea825a - - -] Timed out waiting for a reply to message ID a6b6e0a05c0243e0aa434d792b370240: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID a6b6e0a05c0243e0aa434d792b370240
Root cause

!19 (merged) didn't introduce much changes, so we thought we would get away with it. Plus, there were some nice improvements, such as additional docker healthchecks for a bunch of services. In particular, there was this commit, which introduced the bug we were hitting. Disabling docker healthchecks for RabbitMQ resolved the issue. No further investigation has been made and an issue will be opened upstream.

Original description

Creating a VM ends up in the VM stuck in Scheduling state. Deleting it when in that state works, but re-creating it gives the same result.

Deleting a VM sometimes works, sometimes it stays stuck as Deleting.

There is nothing obvious in Nova logs.

Edited Sep 24, 2021 by Marc Schmitt
Assignee Loading
Time tracking Loading